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Abstract 

We study the estimation of a high dimensional approximate factor model in the pres- 
ence of both cross sectional dependence and heteroskedasticity. The classical method of 
principal components analysis (PCA) does not efficiently estimate the factor loadings or 
common factors because it essentially treats the idiosyncratic error to be homoskedas- 
tic and cross sectionally uncorrelated. For efficient estimation it is essential to estimate 
a large error covariance matrix. We assume the model to be conditionally sparse, and 
propose two approaches to estimating the common factors and factor loadings; both 
are based on maximizing a Gaussian quasi-likelihood and involve regularizing a large 
covariance sparse matrix. In the first approach the factor loadings and the error covari- 
ance are estimated separately while in the second approach they are estimated jointly. 
Extensive asymptotic analysis has been carried out. In particular, we develop the in- 
ferential theory for the two-step estimation. Because the proposed approaches take 
into account the large error covariance matrix, they produce more efficient estimators 
than the classical PCA methods or methods based on a strict factor model. 



Keywords: High dimensionality, unknown factors, principal components, sparse matrix, 
conditional sparse, thresholding, cross-sectional correlation, penalized maximum likelihood, 
adaptive lasso, heteroskedasticity 
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1 Introduction 



In many applications of economics, finance, and other scientific fields, researchers often 
face a large panel data set in which there are multiple observations for each individual; 
here individuals can be families, firms, countries, etc. Modern applications usually involve 
data-rich environments in which both the number of observations for each individual and 
the number of individuals are large. One useful method for summarizing information in a 
large dataset is the factor model: 



where ctj is an individual effect, X 0i is an r x 1 vector of factor loadings and f t is an r x 1 
vector of common factors; Uu denotes the idiosyncratic component of the model. Note that 
Hit is the only observable random variable in this model. If we write y t = (y lt , ym)', 
Ao — (^oi) ■■■■> ^on)', ol = (ai,...,ajv)' and u t = (u u , ujvt)'> then model (1.1) can be 
equivalently written as 



Because yu is the only observable in the model, both factors and loadings are treated 
as parameters to estimate. As was shown by Chamberlain and Rothschild (1983), in many 
applications of factor analysis, it is desirable to allow dependence among the error terms 
{uu}i<N,t<T not only serially but also cross-sectionally. This gives rise to the approximate 
factor model, in which the N x N covariance matrix E u o = cov(iit) is not diagonal. In 
addition, the diagonal entries may vary in a large range. As a result, efficiently estimating 
the factor model under both large iV and large T is difficult because to take into account both 
cross-sectional heteroskedasticity and dependence of {uu}i<N,t<T, it is essential to estimate 
the large covariance S u0 - The latter has been known as a challenging problem when N is 
larger than T. 

In this paper, we assume the model to be conditionally sparse, in the sense that S„o is 
a sparse matrix with bounded eigenvalues. This assumption effectively reduces the number 
of parameters to be estimated in the model, and allows a consistent estimation of S u0 - The 
latter is needed to efficiently estimate the factor loadings. In addition, it enables the model 
to identify the common components oti + \' 0i ft asymptotically as N — > oo. We propose two 
alternative methods, both are likelihood-based. The first one is a two-step procedure. In 
step one, we apply the principal orthogonal complement thresholding (POET) estimator of 
Fan et al. (2012) to estimate £„o using the adaptive thresholding as in Cai and Liu (2011); 
in step two, we estimate the factor loadings by maximizing a Gaussian-quasi likelihood func- 



ytt = a { + \' 0i f t + u it , i< N,t<T, 




y t = a + A ft + u t . 
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tion, which depends on the covariance estimator in the first step. These two steps can be 
carried out iteratively. We also propose an alternative method for jointly estimating the 
factor loadings and the error covariance matrix by maximizing a weighted l\ penalized likeli- 
hood function. The likelihood penalizes the estimation of the off-diagonal entries of the error 
covariance and automatically produces a sparse covariance estimator. We present asymp- 
totic analysis for both methods. In particular, we derive the uniform rate of convergence 
and limiting distribution of the estimators for the two-step procedure. The analysis of the 
joint-estimation is more difficult as it involves penalizing a large covariance with diverging 
eigenvalues. We establish the consistency for this method. 

Moreover, we achieve the "sparsistency" for the estimated error covariance matrix in fac- 
tor analysis (see Section 3 for detailed explanations). The estimated covariance is consistent 
for both approaches under the normalized Frobenius norm even when N is much larger than 
T. This is important in the applications of approximate factor models. 

There has been a large literature on estimating the approximate factor model. Stock and 
Watson (1998, 2002) and Bai (2003) considered the principal components analysis (PCA), 
and they developed large-sample inferential theory. However, the PCA essentially treats u it 
to have the same variance across i, hence is inefficient when cross-sectional heteroskedasticity 
is present. Choi (2012) proposed a generalized PCA that requires N < T to invert the error 
sample covariance matrix. More recently, Bai and Li (2012) estimated the factor loadings 
by maximizing the Gaussian-quasi likelihood, which addresses the heteroskedasticity under 
large N, but they consider the strict factor model in which (u lt , u^t) are uncorrected. 
Additional literature on factor analysis includes, e.g., Bai and Ng (2002), Wang (2009), Dias, 
Pinherio and Rua (2008), Breitung and Tenhofen (2011), Han (2012), etc; most of these 
studies are based on the PCA method. In contrast, our methods are maximum-likelihood- 
based. Maximum likelihood methods have been one of the fundamental tools for statistical 
estimation and inference. 

Our approach is closely related to the large covariance estimation literature, which has 
been rapidly growing in recent years. There are in general two ways to estimate a sparse co- 
variance in the literature: thresholding and penalized maximum likelihood. For our two-step 
procedure, we apply the POET estimator recently proposed by Fan et al. (2012), corre- 
sponding to the thresholding approach of Bickel and Levina (2008a), Rothman et al. (2009) 
and Cai and Liu (2011). For the joint estimation procedure, we use the penalized likelihood, 
corresponding to that of Lam and Fan (2009), Bien and Tibshirani (2011), etc. In either 
way, we need to show that the impact of estimating the large covariances is asymptotically 
negligible for an efficient estimation, which is not easy in our context since the likelihood 
function is highly nonlinear, and A A' contains a few eigenvalues that grow very fast. It was 
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recently shown by Fan et al. (2012) that estimating a covariance matrix with fast diverging 
eigenvalues is a challenging problem. Other works on large covariance estimation include Cai 
and Zhou (2012), Fan et al. (2008), Jung and Marron (2009), Witten, Tibshirani and Hastie 
(2009), Deng and Tsui (2010), Yuan (2010), Ledoit and Wolf (2012), El Karoui (2008), Pati 
et al. (2012), Rohde and Tsybakov (2011), Zhou et al. (2011), Ravikumar et al. (2011) etc. 

This paper focuses on high-dimensional static factor models although the factors and 
errors can be serially correlated. The model considered is different from the generalized 
dynamic factor models as in Forni, Hallin, Lippi and Reichlin (2000), Forni and Lippi (2001), 
Hallin and Liska (2007), and other references therein. Both static and dynamic factor models 
are receiving increasing attention in applications of many fields. 

The paper is organized as follows. Section 2 introduces the conditional sparsity assump- 
tion and the likelihood function. Section 3 proposes the two-step estimation procedure. In 
particular, we present asymptotic inferential theory of the estimators. Both uniform rate 
of convergence and limiting distributions are derived. Section 4 gives the joint estimation 
as an alternative procedure, where we demonstrate the estimation consistency. Section 5 
illustrates some numerical examples which compare the proposed methods with the existing 
ones in the literature. Finally, Section 6 concludes with further discussions. All proofs are 
given in the appendix. 

Notation 

Let A max (^4) and \ min (A) denote the maximum and minimum eigenvalues of a matrix A 
respectively. Also Let \\A\\i, \\A\\ and \\A\\ F denote the /i, spectral and Frobenius norms of 
A, respectively. They are defined as ||A||i = max, ^ I \\M = \/K^jA 1 ~A) ) \\A\\ F = 
y/tr(A'A). Note that \\A\\ is also the Euclidean norm when A is a vector. For two sequences 
ar and br, we write a F <C br, and equivalently b T ^> ot, if o, F = o(br) as T — > oo. 

2 Approximate Factor Models 

2.1 The model 

The approximate factor model (1.1) implies the following covariance decomposition: 

S s/0 = A cov(/ t )A / + S u0 , (2.1) 

assuming f t to be uncorrelated with ut, where S y o and E„o denote the N x N covariance 
matrices of y t and u t ; cov(f t ) denotes the r x r covariance of f t , all assumed to be time- 
invariant. The approximate factor model typically requires the idiosyncratic covariance S m0 
have bounded eigenvalues and A' A have eigenvalues diverging at rate O(N). One of the 
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key concepts of approximate factor models is that it allows S u0 to be non-diagonal. 

Stock and Watson (1998) and Bai (2003) derived the rates of convergence as well as the 
inferential theory of the method of principal component analysis (PCA) for estimating the 
factors and loadings. Let Y = (yi, ...,2/t)' be the T x N data matrix. Then PCA estimates 
the T x r factor matrix F by maximizing ti(F' (YY')F) subject to normalization restrictions 
for F. The PCA method essentially restricts to have cross-sectional homoskedasticity and 
independence. Thus it is known to be inefficient when the idiosyncratic errors are either 
cross sectionally heteroskedastic or correlated. 

This paper aims at the efficient estimation of the approximate factor model, and assumes 
the number of factors r to be known. In practice, r can be estimated from the data, and 
there has been a large literature addressing its consistent estimation, e.g., Bai and Ng (2002), 
Kapetanios (2010), Onatski (2010), Alessi et al. (2010), Hallin and Liska (2007), Lam and 
Yao (2012), among others. 

2.2 Conditional sparsity 

An efficient estimation of the factor loadings and factors should take into account both 
cross- sectional dependence and heteroskedasticity, which will then involve estimating S„o = 
cov(-Uj), or more precisely, the precision matrix S~q. In a data-rich environment, N can be 
either comparable with or much larger than T. Then estimating S m0 is a challenging problem 
even when the idiosyncratics {uu}i<N,t<T are observable, because the sample covariance is 
nonsingular when N > T, whose spectrum is inconsistent (Johnstone and Ma 2009). 

Under the regular approximate factor model considered by Chamberlain and Rothschild 
(1983) and Stock and Watson (2002), it is difficult to estimate S m0 without further structural 
assumptions. A natural assumption to go one-step further is that of sparsity, which assumes 
that many off-diagonal elements of S u0 be either zero or vanishing as the dimensionality 
increases. In an approximate factor model, it is more appropriate to assume S u0 be a sparse 
matrix instead of S^o- Due to the presence of common factors, we call such a special structure 
of the factor model to be conditionally sparse. 

Therefore, the model studied in the current paper is the approximate factor model with 
conditional sparsity (sparsity structure on £ u o), which is sightly more restrictive than that 
of Chamberlain and Rothschild (1983). The conditional sparsity is required to regularize a 
large idiosyncratic covariance, which allows us to take both cross sectional correlation and 
heteroskedasticity into account, and is needed for an efficient estimation. However, such 
an assumption is still quite general and covers most of the applications of factor models in 
economics, finance, genomics, and many important applied areas. 
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2.3 Maximum likelihood 



Compared to PCA, a more efficient estimation for model (2.1) of high dimension is based 
on a Gaussian quasi-likelihood approach. Let / = T _1 Ylt=i ft- Because of the existence of 
a, the model y t = A f t + a + u t is observationally equivalent to y t = A / t * + a* + u t , where 
/ t * — ft~ f and a* = a + A f. Therefore without loss of generality, we assume / = 0. The 
Guassian quasi-likelihood for E y is given by 

-N- 1 log | det(E y )| - N-hiiSyZy 1 ) 

where S y = T _1 Y^t=i(Vt~y)(yt — y)' is the sample covariance matrix, with y = T~ x Y^t=i Vt- 
Plugging in (12. ip . using the notation Sf = ^ Ylt=i ftfti we obtain the quasi-likelihood func- 
tion for the factors and loadings: 

- i log |det {AS f A' + E„)| - itr (S y (AS f A' + SJ" 1 ) , (2.2) 

where A = (Ai, Ajv)' is an iV x r matrix of factor loadings. 

It has been well known that the factors and loadings are not separably identified without 
further restrictions. Note that the factors and loadings enter the likelihood through ASfA'. 
Hence for any invertible r x r matrix H, if we define A* = AH^ 1 , / t * = Hf t and Sf* = 
Ylt=i ftfti then A*Sf* A*' = ASfA', and they produce observationally equivalent models. 
In this paper, we focus on a usual restriction for MLE of factor analysis (see e.g., Lawley 
and Maxwell 1971) as follows: 

Sf = I r , and A'S~ 1 A is diagonal, (2.3) 

and the diagonal entries of A'E~ X A are distinct and are arranged in a decreasing order. 
Restriction ( 12. 3 p guarantees a unique solution to the maximization of the log-likelihood 
function up to a column sign change for A. Therefore we assume the estimator A and Ao 
have the same column signs, as part of the identification conditions. 
The negative log-likelihood function (12. 2p simplifies to 

- L(A, E u ) = i log [det (AA' + S u )| + itr (S y (AA' + EJ- 1 ) . (2.4) 

In the presence of cross sectional dependence, E M o is not necessarily diagonal. Therefore 
there can be up to 0(N 2 ) free parameters in the likelihood function ( 12. 4p . There are in 
general two main regularization approaches to estimating a large sparse covariance: (adap- 
tive) thresholding (Bickel and Levina 2008a, Rothman et al. 2009, Cai and Liu 2011, etc.) 
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and penalized maximum likelihood (Lam and Fan 2009, Bien and Tibshirani 2011). Corre- 
spondingly in this paper, we propose two methods for regularizing the likelihood function to 
efficiently estimate the factor loadings as well as the unknown factors. One estimates S u0 
and A in two steps and the other estimates them jointly. 

3 Two-Step Estimation 

The two-step estimation estimates (A , S„o) separately. In the first step, we estimate 
S u0 by the principal orthogonal complement thresholding (POET), proposed by Fan et al. 
(2012), and in the second step we estimate A only, using the quasi-maximum likelihood, 
replacing S u by the covariance estimator obtained in step one. 

3.1 Step one: covariance estimation by thresholding 

The POET is based on a spectrum expansion of the sample covariance matrix and adap- 
tive thresholding. Let (vj , £,j)jLi be the eigenvalues-vectors of the sample covariance S y of 
y t , in a decreasing order such that ui > u 2 > ... > u N . Then S y has the following spectrum 
decomposition: 

r 

i=l 

where R = J2f =r+1 Vi^l is the orthogonal complement component. Define a general thresh- 
olding function Sij(z) : R — > R as in Rothman et al. (2009) and Cai and Liu (2011) with an 
entry-dependent threshold such that: 

(i) Sij(z) = if \z\ < Tif, 

(ii) \ Sij ( Z )-z\ < Tij. 

(iii) There are constants a > and b > 1 such that \sij(z) — z\ < ar^ if \z\ > br^. 
Examples of Sij(z) include the hard-thresholding: Sij(z) = zI^ z \ >Tij y, SCAD (Fan and Li 
2001), MPC (Zhang 2010) etc. Then we obtain the step-one consistent estimator for S u0 : 

= (sij(Rij))NxN, where R = (Rij) NxN . 

We can choose the threshold as = C RuRjj ( a/ (log N) /T + 1/y/N) for some universal 
constant C > 0, which corresponds to applying the threshold C ( a/ (log N) /T + l/y/N) to 
the correlation matrix of R [defined to be diag(i?) _1 / 2 i? diag(-R)" 1 / 2 ]. The POET estimator 
also has an equivalent expression using PCA. Let {uf t CA }i<N,t<T denote the PCA estimators 
of {u lt h<N,t<T (Bai 2003). Then E« = s t] (T^ J^ =1 n? t CA uJ t CA ). 
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It was shown by Fan et al. (2012) that under some regularity conditions — £«o|| = 
O p (N~ 1 ^ 2 +T~ 1 ^ 2 (log N) 1 / 2 ), which guarantees the positive definiteness asymptotically, given 
that Amin (S„o) > is bounded away from zero. 

3.2 Step two: estimating factor loadings and factors 

Replacing E u in (12. 4p by S« , we obtain the objective function for A. Under the identi- 
fication condition fl 2 . 3 j) . in this step, we estimate the loadings as: 

A 1 - 1 ) = arg min Li (A) 

Aee A 

= arg min i log | det(AA' + + ^tr(S,(AA' + S^)- 1 ) (3.1) 

where 0a is a parameter space for the loading matrix, to be defined later. Suppose that 
y t ~ N(0, AqAq + S u o), the negative log-likelihood is then the same (up to a constant) as 
(13. ip except that sl 1 ^ should be replaced by S u0 - Consequently, (13. ip can be treated as a 
Gaussian quasi-likelihood of A, which will give an efficient estimation of A since it takes into 
account the cross sectional heteroskedasticity and dependence in T> u0 through its consistent 
estimator. 

After obtaining we estimate f t via the generalized least squares (GLS) as suggested 
by Bai and Li (2012): 

F = (A^(Ef ) )- 1 X (1) )- 1 X w '(^ 1) )- 1 (»i-y). 

The proposed two-step procedure can be carried out iteratively. After obtaining 
(A«, jf } ), we update 

T 

u t = y t - 7S l) ft\ gW = (sij(T -1 ^2 UitUjt))NxN- 

t=i 

Then Su in the objective function ( 13.11) is updated, which gives updated A« and ^ 
respectively. This procedure can be continued until convergence. 

3.3 Positive definiteness 

The objective function (13.11) requires AA' + be positive definite for any given finite 
sample. A sufficient condition is the finite-sample positive definiteness of Tiu \ which also 
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depends on the choice of the adaptive threshold value r^. We specify 




where is an entry-dependent value that captures the variability of individual variables 
such as yjRaRjj\ C > is a pre-determined universal constant. More concretely, the finite 
sample positive definiteness depends on the choice of C. If we write Eu = Ei (C) in step 
one to indicate its dependence on the threshold, then C should be chosen in the interval 

(Cmin, C max ], where 

C min = inf{M : A min (E«(C)) > 0, VC > M}, 

and C max is a large constant that thresholds all the off-diagonal elements of E« to zero. 
Then by construction, E^ (C) is finite-sample positive definite for any C > C m i n (see Figure 

CD. 



Figure 1: Minimum eigenvalue of A min (sl 1 ' > (C)) 
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Data are simulated from the setting of Section 5 with T = 100, N = 150. Both hard and SCAD 
with adaptive thresholds (Cai and Liu 2011) are plotted. 



3.4 Asymptotic analysis 

We now present the asymptotic analysis of the proposed two-step estimator. We first 
list a set of regularity conditions and then present the consistency. A more refined set of 
assumptions are needed to achieve the optimal rate of convergence as well as the limiting 
distributions. 
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3.4.1 Consistency 

Assumption 3.1. Let S u0 ,ij denote the (i,j)th entry o/S u0 - There is q G [0, 1) such that 

N 

m N = maxV |£ u0 ,;j| 9 = o(min(\/iV, \jTj log AT)). 
" i=i 

In 'particular, when q = 0, we define mjy = maXj<Ar X^Li ^(B u0 ,y/o); which corresponds to the 
"exactly sparse" case. 

The first assumption sets a condition on the sparsity of S„o, under which Fan et al. 
(2012) showed that the POET estimator £„ is consistent under the operator norm. The 
sparsity is in terms of the maximum row sum, considered by Bickel and Levina (2008a). 

The following assumption provides the regularity conditions on the data generating pro- 
cess. We introduce the strong mixing condition. Let P®^ and J-jP denote the cx-algebras 
generated by {(f t ,u t ) : — oo < t < 0} and {(ft, u t ) '■ T < t < oo} respectively. In addition, 
define the mixing coefficient 

a(T)= sup \P(A)P(B) - P(AB)\. (3.2) 

Assumption 3.2. (i) {it t , ft}t>i is strictly stationary. In addition, Euu = Euufjt = for 
all i < p, j < r and t <T. 

(ii) There exist constants Ci,c 2 > such that c 2 < A min (E u0 ) < A max (E u0 ) < c±, and 
maXjXAr ||Aoj|| < C\. 

(Hi) Exponential tail: There exist T\,Ti > and 61,62 > 0, such that for any s > 0, i < p 
and j < r, 

P(\u it \ >s)< exp(-( S /&i) ri ), P(|/ it | > s) < exp(-(s/6 2 ) r2 ). 

(iv) Strong mixing: There exists r 3 > such that 3r^ x + l.hr^ 1 + r^ 1 > 1, and C > 
satisfying: for all T <E Z + , 

a(T) <exp(-CT r3 ). 

The following assumptions are standard in the approximate factor models, see e.g., Stock 
and Watson (1998, 2002) and Bai (2003). In particular, Assumption 13.31 implies that the 
first r eigenvalues of A A' are growing rapidly at O(N). Intuitively, it requires the fac- 
tors be pervasive in the sense that they impact a non-vanishing proportion of time series 

{yu}t<T, -i{ym}t<T- 
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Assumption 3.3. There is a 5 > such that for all large N, 



r 1 < A min (iV- 1 A[ ) A ) < A max (iV- 1 A^A Q ) < 6. 

Therefore all the eigenvalues of N~ 1 A' Aq are bounded away from both zero and infinity as 
N -> oo. 

Assumption 3.4. There exists M > such that for all t < T and s <T, 

(i) E[N- l / 2 {v! s u t - Eu' s u t )} A < M, 
(u)E\\N- l l 2 Ef=iV^H 4 < M - 

The following assumption defines the threshold on the (i, j)th entry of for the 
step-one POET estimator. 

Assumption 3.5. The threshold = Cctij(\J (log N)/T + 1/y/T) where > is entry- 
dependent, either stochastic or deterministic, such that Ve > ; there are positive C\ and Ci 
so that 

P{C\ < min aij < max aij < C 2 ) > 1 — e (3.3) 
for all large N and T. Here C > is a deterministic constant. 



Condition (13. 3 h requires the rate x (a/ (log N)/T + 1/yT) uniformly in (i,j). This 
condition is satisfied by the universal threshold ai^ = a for all (i, j), the correlation threshold 
oiij = ^jRuRjj as discussed before, and the adaptive threshold in Cai and Liu (2011). 

For identification, we require the objective function be minimized subject to the diagonal- 
ityof A'^)-^. In addition, since Assumption 13. 31 is essential in asymptotically identifying 
the covariance decomposition S y o = AoA' + S n o, we need to take it into account when mini- 
mizing the objective function. Therefore we assume 8 in Assumption 13.31 is sufficiently large, 
which leads to the following parameter space: 

6 A = {A : r 1 < \ min (N^A'A) < A max (iV- 1 A'A) < 5, 

A'(EW)- X A is diagonal.} (3.4) 

Write 7" 1 = 3r^ + 1.57VT 1 + r^ 1 + 1 and A^ = (A? 5 , A^)'. We have the following 
theorem. 

Theorem 3.1. Suppose (logN) 6 ^ = o(T), T = o(N 2 ). Under Assumptions{3JMIM 
3=||A« - A \\ F = Op(l), max ||Af - A J = 0,(1). 
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By a more careful large-sample analysis, we can improve the above result and derive the 
rate of convergence. Throughout the paper, we will frequently use the notation: 

1 /log N 

Theorem 3.2. Under the Assumptions of Theorem \3.1\ 

-JL||A {1) - A ||f = O p (m N u^' q ), max ||A^ - A j|| = O p (m N iJf q ), 
where and q are defined in Assumption \3.1[ 

Remark 3.1. In the above theorem does not need to be bounded. But in order to 
achieve the v^-consistency for each Aj, the uniform rate of convergence above would require 
it be bounded (which is a strong assumption on the sparsity of T, u0 ). Later in Section 3.4.3 
we will enhance this convergence rate so that the boundedness of is not necessary and 
\/T-consistency can still be achieved. This will require additional regularity conditions. 

3.4.2 Covariance estimation and sparsistency 

In order to obtain the limiting distribution for each individual we also need to 
achieve the sparsistency for estimating S n o- By sparsistency, we mean the property that all 
small entries of S u0 are estimated as exactly zeros with a probability arbitrarily close to 
one. Besides being important for deriving the limiting distribution of A^ , the sparsistency 
itself is of independent interest for large covariance estimation, and has been studied by 
many authors, for instance, Lam and Fan (2009) and Rothman et al. (2009). To our best 
knowledge, this is the first place where the sparsistency for an estimated idiosyncratic S n0 
is achieved in a high dimensional approximate factor model. 

Let Sl and Su denote two disjoint sets and respectively include the indices of small and 
large elements of T, u0 in absolute value, and 

{(i,j):i<N,j<N} = S L \JSu. 

Because the diagonal elements represent the individual variances of the idiosyncratic com- 
ponents, we assume G Su for all % < N. The sparsity assumes that most of the indices 
(i,j) belong to Sl when i ^ j. A special case arises when S m0 is strictly sparse, in the sense 
that its elements in small magnitudes (Sl) are exactly zero. For the banded matrix as an 
example, 

S u0 ,ij 7^ if \i - j\ < k; £ n0 ,ij = if \i - j\ > k 
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for some fixed k. Then Sl = '■ \i — j\ > k} and Sjj = '■ \i — j\ < k}. 

The following assumption quantifies the "small" and "big" entries of E^o- By "small" 
entries we mean those of smaller order than wy = N~ 1/2 + T~ l / 2 {\ogN) 1 / 2 . The partition 
{(hj) '■ i < N, j < N} = Sl U Su may not be unique. Our analysis suffices as long as such 
a partition exists. 

Assumption 3.6. There is a partition {{i,j) : % < N,j < N} = SlUSu such that (i, i) e Su 
for alii < N and Sl is nonempty. In addition, 

max |£ u0 ,i,-| < < min \T, u0ij \. 

The conditional sparsity assumption requires most off-diagonal entries of S^o be inside 
Sl, hence it is reasonable to have Sl 7^ in the condition. It is likely that Su only contains 
the diagonal elements. It then essentially corresponds to the strict factor model where S u0 is 
almost a diagonal matrix and error terms are only weakly cross-sectionally correlated. That 
is also a special case of Assumption 13.61 

Theorem 3.3. Under Assumption \3.6\ and those of Theorem \3.2l for any e > and M > 0, 

there is an integer No > such that as long as T and N > Nq, 

F(sJUo ) V(i,j)e5 L )>i- £l 

P(0^\>Mu T Mhj)eSu)>l-e. 

It was shown by Fan et al. (2012) that — S~g|| = O p {mN^j^ q )- Theorem 13.41 

below demonstrates a strengthened convergence rate for the averaged estimation error. 

Assumption 3.7. There is c> such that ||S~o ||i < c - 

In addition to Assumptions 13 . II and 13.61 we require the following condition on the sparsity 
of E u0 , which further characterizes Sl and Su' 

Assumption 3.8. The index sets Sl and Su satisfy: u, j)es v ^ = ^C^O an d 

^(i,j)eS L \^o,ij\ = 0(1). 

Assumption 13.81 requires that the number of off-diagonal large entries of T, u0 be of order 
O(N), and that the absolute sum of the small entries is bounded. This assumption is 
satisfied, for example, if {uu}i<N follows an heteroskedastic MA(p) process with a fixed 
p, where J2i^j,(i,j)eSu 1 = O(N) and J2(i,j)es L \^uo,ij\ = 0. It is also satisfied by banded 
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matrices (Bickel and Levina 2008b, Cai and Yuan 2012) and block-diagonal matrices with 
fixed block size. 

Define an r x N matrix H = AqE^q 1 = (£1, ...,£jv). Then ||E~q ||i < c implies 

AT 

maxll^H = max || V A 0i (£~dkll < HE^Hi max ||A 0j || < oo. 

i=l 

The following assumption corresponds to those of PCA in Bai (2003), and also extends 
to the non-diagonal £ m0 - 

Assumption 3.9. (i) E\\^YLi - Eu' s u t )\\ 2 = 0(1) 

(it) For each element d^ui of (k,l < r), 

j^Wf ^j=i £*Li YlJ=i( u itUjt - EuitUjtjXoiXo^ki = O p (l), 

7= Ef=i £L(4 - EuUti = O p (l). 

fnij For eac/i element d^^i of t;^, 

N\/nt ^2i^j,(i,j)eSu St=i ~ EuitU^XojX'^dij^i = O p (l), 



Under Assumption 13.91 we can achieve the following improved rate of convergence for the 



averaged estimation error E.L — £ 



u0- 



Theorem 3.4. Under the assumptions of Theorem \3.3\ and Assumption \3.9i 

jj\\K[®Pr l - Ko]^o\\f = O p {mWf 2q )- 

Remark 3.2. 1. A simple application of 

|| (E^)- 1 - E~g || = O p {m N uj^r q ) by Fan et al. (2012) yields 

-^||Aq[(Eu ) _1 — S~q ] Ao||ir = O p (mj V a;^~' 3 ). In contrast, the rate we present in Theorem 
I3.4l requires more refined asymptotic analysis. It shows that after weighted by the factor 
loadings, the averaged convergence rate is faster. 

2. The condition on the large-entry-set Su in Assumption 13.81 can be relaxed a bit to 
Y^i^tj a j)eSv ^ = 0(N 1+e ) for an arbitrarily small e > 0, which will allow less sparse 
covariances. For example, Suppose {uu}i<N follows a cross sectional AR(1) process 
such that 

u it = pui- ljt + e it 

for |p| < 1 and {e^}j<jv,t<T being independent across both i and t. We can then find 
a partition S L U Su such that E(ij) G s L l s «o,y| = 0(1) and 52&j,(M)eSv 1 = °( Nl+e ) 
for any e > 0. Theorems 13.41 and 13.51 below still hold. But conditions in Assumption 
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13.91 need to be adjusted accordingly. For example, in condition (iii) the normalizing 
constant N ^ in the first equation should be changed to ^+3/2 > an d in the second 
equation should be changed to ^(1+^/2 ■ The current Assumption 13.91 on the other 
hand, keeps our presentation simple. 

3.4.3 Limiting distribution 

As a result of Theorem 13.41 the impact of estimating E„o at step one is asymptotically 
negligible. This enables us to achieve the -y/T-consistency and the limiting distribution of 



A^ for each j. We impose further assumptions. 
Assumption 3.10. (i) ^= £f =1 Ef=i Eti( 

Eu it u jt )CiCj = O p (l). 
For each j < N,^L= Y7t=ii u it u jt ~ Eu it u jt )^ = O p (l). 

H Eii Eli tiUitfl = o p (i). 

Theorem 3.5. Suppose < q < 1/2, and T = o(N 2 - 2q ). In addition, m 2 N uj 2 ~ 2q = o(T" 1 / 2 ) 



Then under the assumptions of Theorera \3.J\ and Assumption \3.1(K for each j < N, 

Vf(\f ] - X 0j ) -> d N r (0, E{u jt f t f t )) . 

We make some technical remarks regarding Theorem 13.51 

Remark 3.3. 1. The condition m 2 N uf?f 2q = o{T~ l l 2 ) (roughly speaking, this is = 
o(T 1 / 4 ) when N is very large and q = 0) strengthens the sparsity condition of Assump- 
tion [37TJ The required upper bound for m N is tight. Roughly speaking, the estimation 
error of plays a role in the asymptotic expansion of y/T{\^ — Aoy) only through 
an averaged term as in Theorem 13.41 Condition m 2 N ui^r 2q = o(T -1 / 2 ) is required for 
that term to be asymptotically negligible. 

2. The asymptotic normality also holds jointly for finitely many estimators. For any finite 
and fixed k, we have, 

Vt(X?' - A' 01 , • • • , Ai 1} ' - A' ofc )' ^ AUO, E[cov(u«\f t ) ® W). 

where cov(u^\f t ) = cov(u lt , ...,u kt \f t ). 

3. If Assumption I3.10( i) is replaced by a uniform convergence, by assuming 
m aXi<JV \\^fY.iLiYlt=i( u it u jt ~ EuitUj t )£i\\ = O p ( v / Alog N), we can then improve 
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the uniform rate of convergence in Theorem 13.21 and obtain 



n 0(1) x I, n f nogN 
max||A} -X 0j \\ =O p (y— ). 

3.4.4 Estimation of common factors 

For the limiting distribution of / t , we make the following additional assumption: 

Assumption 3.11. There is a positive definite matrix Q such that for each t <T, 
1 1 1 N 

3=1 

For the next assumption, we define fit = ^uo u t- Then (3 t has mean zero and covariance 
matrix £~q. 



Assumption 3.12. For any fixed t <T , 
_l 

/NT 



(i) 7?w ET=i Eii fsUuPa = O p (l), 



Ei=i Ei=i E,=i - Eu is u js )f3 jt = Op(l) 

^ Ef=i Er =1 « - = op(i) 

Ei^,(<j)6St, ELiC^i* - Eu is u js )£ip jt = o p (l). 
(ii) For each k <r, 

JprVN Eil EjLi ELiK^s - EUisU^Xoi^tikPu = o p (l), 
NTy/N ^i^j,{i,j)£Su Ea=l E«=l( M is M «s ~ Eu is Ui s )\ 0j \' 0l t,ikf3jt = O p (l). 

Theorem 3.6. Under the assumptions of Theorem \3.5[ we have for each fixed t <T, 

||F - / t || = O p (m JV 4~ 9 (logT) 1 /'- 1 + 1 /r 2) . 

where r 1; r 2 > are defined in Assumption \3.S[ 



If in addition Assumptions \3. Il\ \3.1S\ are satisfied and y/Nm N u T 2q = o(l). T/ien w/ien 
j.l/(2-2g) ^ TV < T 2 ~ 2q , 

VN{W ] -ft) -^ d N(0,Q-i). 

Remark 3.4. 1. It follows from Theorem 13.61 that for each fixed t, / t is a root- iV 
consistent estimator of f t . Root- N consistency for the estimated common factors also 
holds for the principal components estimator as in Bai (2003). In addition, the above 
limiting distribution holds only when iV = o(T 2 ). 
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2. If we strengthen the assumption to max t < T ||^= ^2 i=1 Ci u it\\ — O p (logT), then the 
uniform rate of convergence can be achieved: 

max || - f t \\ = O p (m JV 4^(logT) 1 ^ +1 ^ +1 ). 

To compare this rate with that of the PCA estimator, we consider for simplicity, the 
strictly sparse case q = 0. Then when N = o(T 3 / 2 ) and is either bounded or 
growing slowly (m 2 N < mm{VT,T 3 / 2 /N}), the ab ove rate is faster than that of the 
PCA estimator. (The above rate is O p ((logT) 1/ri+1/r2 /v / iV) when N = 0(T), whereas 
the uniform convergence rate for PCA estimator is O p (T 1//4 / 'y/~N).) 



4 Joint Estimation 

4.1 l\- penalized maximum likelihood 

One can also jointly estimate (A , S u0 ) to take into account the cross-sectional dependence 
and heteroskedasticity simultaneously. As in the sparse covariance estimation literature (e.g., 
Lam and Fan 2009, Bien and Tibshirani 2011), we penalize the off-diagonal elements of 
the error covariance estimator, and minimize the following weighted-/i penalized objective 
function, motivated by a penalized Gaussian likelihood function: 

(A (2) ,Sl 2 )) = arg min L 2 (A,S U ) 
(A,E u )ee A xr 

= arg A min r 1 log | det(AA' + E u )| + — tr(S , „(AA' + SJ" 1 ) 

where V is the parameter space for E u , to be defined later. We introduce the weighted 
/i-penalty N^/it ^i^j w ij\^u,ij\ with Wij > to penalize the inclusion of many off-diagonal 
elements of ^ u ,ij m small magnitudes, which therefore produces a sparse estimator E„ . 
Here /it is a tuning parameter that converges to zero at a not-too-fast rate; is an entry- 
dependent weight parameter, which can be either deterministic or stochastic. Popular choices 
of in the literature include: 

Lasso The choice Wij = 1 for all % ^ j gives the well-known Lasso penalty N~ l fiT Yli^j \^u,ij I 
studied by Tibshirani (1996). The Lasso penalty puts an equal weight to each element 
of the idiosyncratic covariance matrix. 
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Adaptive-Lasso Let S*^- be a preliminary consistent estimator of E u o,y- Let = 
IS*^! -7 for some 7 > 0, then 

/fiv™--l£ ..|-^VlS* i~ 7 i£ -I 

corresponds to the adaptive-lasso penalty proposed by Zou (2006). Note that the 
adaptive-lasso puts an entry-adaptive weight on each off-diagonal element of S u , whose 
reciprocal is proportional to the preliminary estimate. If the true element E u o,ij £ Sl, 
the weight |E* jj| -7 should be quite large, and results in a heavy penalty on that entry. 
The preliminary estimator E* 4 - can be taken, for example, as the PCA estimator 
fj A = T~ l Y^t=i uft CA u^ t CA ' . It was shown by Bai (2003) that under mild conditions, 
V% A -V*,ii = P {N-W + T-W). 

SCAD: Fan and Li (2001) proposed to use, for some a > 2 (e.g, a = 3.7) 

T (a-\K,ij\/pr)+ T 

Wi i ~ j (|s;^i<mt) + a _i h\^\>^y 

The notation z + stands for the positive part of z\ z + is z if z > 0, zero otherwise. 
Here E* i • is still a preliminary consistent estimator, which can be taken as the PCA 
estimator. 

4.2 Consistency of the joint estimation 

We assume the parameter space for E m0 to be, for some known sufficiently large M > 0, 

r = {E u : ||E u ||i < M, llS^Hi < M}. 

Then E u0 G T implies that all the eigenvalues of E u0 are bounded away from both zero and 
infinity. There are many examples where both the covariance and its inverse have bounded 
row sums. For example, for each t, when {un}i = i follows a cross sectional autoregressive 
process AR(p) for some fixed p, then the maximum row sum of E u0 is bounded. The inverse 
of E„o is a banded matrix, whose maximum row sum is also bounded. 

As before we assume T~ l Ylt=i ft ft = Ir and AqE~q A be diagonal for identification. In 
addition, Assumptions 13.21 and 13.31 for the two-step estimation are still needed. Those con- 
ditions such as strong mixing, weakly dependence and bounded eigenvalues of A _1 A' Ao 
regulate the data generating process, and asymptotically identify the covariance decomposi- 
tion f[2~Tj) . 
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The conditions for the partition : i,j < AT} = Sl U SV of S u0 are replaced by the 

following, which are weaker than those of two-step estimation in Assumption 13.81 Define the 
number of off-diagonal large entries: 

D= £ I- (4.2) 

Assumption 4.1. There exists a partition i < N,j < N} = Sl U Su where Su and 

Sl are disjoint, which satisfies: 

(i) E„ ,ii G ^ /or a// z < AT, 

(ii) D = o(mm{N^/T/ log AT, N 2 /\ogN}), 

The following assumption is imposed on the penalty parameters. Define the weights 
ratios 

Assumption 4.2. The tuning parameter lit and the weights {tfy}i<iv,i<Ar satisfy: 
(i) 



a T = o v 



T N ( T \ 1/4 IN N 



mm 



logND'yiogNJ V £>' V^logAT 

#r |S u0 ,ul = Op(iV), 

(ii) li t max (i)i)e5i w y E(i,j)es,L l S «o,ij| = o(min{A^, A^ 2 / D, A^ 2 / (Day)}), 
/iTinaXi^j^^Wij = o(mm{N/D, y/N/D, N/(Da T )}), 
fjL T imn {i;j)eSL > v/log AT/T+ (log N)/N. 

The above assumption is not as complicated as it looks, and is satisfied by many examples. 
For instance, the Lasso penalty sets = 1 for all i, j < N. Hence ay = Pt = 1- Then 
condition (i) of Assumption 14.21 follows from Assumption I4.1( ii). which is also satisfied if 
D = O(N). Condition (ii) is also straightforward to verify. This immediately implies the 
following lemma. 

Lemma 4.1 (Lasso). Choose = 1 for all i, j < N,i ^ j . Suppose in addition D = O(N) 

and log A" = o(T). Then Assumption \4-2\ is satisfied if the tuning parameter lit = o(l) is 

such that i 

log AT log AT 
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One of the attractive features of this lemma is that the condition on [ij< does not depend 
on the unknown S u0 - We will present the adaptive lasso and SCAD as another two examples 
of the weighted-/i penalty in Section 4.3 below, both satisfy the above assumption. 

Our main theorem is stated as follows. 



Theorem 4.1. Suppose logN = o(T). Under Assumptions EJ, E3 \3l\ VTh and\4^ the 

penalized ML estimator satisfies: as T and N — > oo, 



For each t <T, 



\\ii 2) -ft\\=o P (i). 



Remark 4.1. 1. The consistency for / t v can be made uniformly in t < T if the condition 



n 

is strengthened to max t <r ||A^ -1 / 2 ^2 i=1 £iUu\\ = o p (y/~N). 



2. To establish the consistency in the high dimensional literature, one usually constructs 
a neighborhood of the true parameters (A , S„o) £ U ( e -g-> Rothman et al. 2008, 
Lam and Fan 2009), and show that with probability approaching one, L 2 (A ,S u o) > 
sup( AE L 2 (A, E u ). This strategy however, does not work here due to the technical 
difficulty in dealing with the term (A A' + E u ) in the likelihood function, because its 
largest r eigenvalues are unbounded and grow at rate O(N) uniformly in the parameter 
space. One of the contributions of Theorem 14.11 is to achieve consistency using a 
new strategy to deal with the penalized likelihood function, which involves diverging 
eigenvalues. 

In this paper we only present the consistency for the joint estimation, which is already 
technically difficult as one needs to deal with an equilibrium of the first order conditions for 
both (A^ 2 -*, Xll 2 '') simultaneously. Deriving the limiting distributions for the joint estimators 
is difficult, and we leave this as a future topic. 



4.3 Two examples 

We present two popular choices for the weights as examples: one is adaptive lasso, pro- 
posed by Zou (2006), and the other is SCAD by Fan and Li (2001). Both weights depend on 
a preliminary consistent estimate of each element of T> u o- in the high dimensional approx- 
imate factor model, a simple consistent estimate for each element can be obtained by the 
principal component analysis (Stock and Watson 1998 and Bai 2003). 
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To simplify the presentation, we will assume that D = O(N), which controls the number 
of off-diagonal large entries of S u0 - Moreover, we retain Assumption 13. 6t 



max{|S MOiij | : G S L } < u T < min{|£ u(W | : £ u0 ,y G Su}, 



and recall that cot 



logN I 1_ 



Let the initial estimate S* ^ 



Rij, where Rij is the PCA estimator of S n0 ,ij as in Bai 



(2003). The adaptive lasso chooses the weights to be, for some constant 7 G (0, 1], 



(Adaptive Lasso) 



(\K«\+8t) 



(4.3) 



where St = o(l) is a pre-determined nonnegative sequence. The additive 5t was not included 
in the original definition of adaptive lasso in Zou (2006), but has often been seen in recent 
literature, e.g., Xue and Zou (2012). We include it here in the weights to prevent Wij getting 
too large if |££y| is very close to zero. The adaptive lasso has been used extensively in the 
high dimensional literature, see for example, Huang, Ma and Zhang (2006), van de Geer, 
Buhlmann and Zhou (2011), Caner and Fan (2011), etc. 

Another important example is SCAD, defined as: for some a > 2, 



(SCAD) : Wij = ..,< MT) + — / 



(4.4) 



We have the following theorem. 



Theorem 4.2. Suppose either the Adaptive Lasso or SCAD is used for the weighted-li pe- 



nalized objective function. Also, suppose log N = o{T), D = O(N), Yl 



o(N) 



and Assumptions \3.S\ \3.3\ 3.1, 4-1 hold. In addition, assume the tuning parameters are such 
that: 

(i) for Adaptive Lasso, 



(i,j) &S L \^uO,ij\ 

N 



1/7 



<C S T <C CUT, 



(4.5) 



(it) for SCAD: 



uj t +1 <C fi T <C co T ; 



(4.6) 



logN 
T 



1/4 



logN 
N 



1/2 



<C Ht <C min 



J uO,ij I • 



(4.7) 
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Then Assumption \4.2\ is satisfied, and 

^l|Si 2) - E u0 || 2 F = o p (l), i||A( 2 ) - A \\l = o p (l). 
\\f} 2) -ft\\=o p (l). 

As in the case of Lemma 14.11 an attractive feature of this theorem is that, if both the 
upper bound of Yu(ij)es L l^wl anc ^ ^he l°wer bound of 
mmi^j^j^Su \ ^vO,ij\ are known, [e.g., in the strictly sparse model, 

J2(ij)es L \^uo,ij\ = 0, and assume min^^gs^ |S u0 ,ij| is bounded away from zero as in 
MA(1)] then Conditions (gS} - (M do not depend on any other unknown feature of E u o- 

5 Numerical Examples 

We propose a novel algorithm to numerically minimize the objective function L 2 (A, E u ) 
( 14.1 p for joint estimation, which combines the EM algorithm with the majorize- minimize 
method recently proposed by Bien and Tibshirani (2011). The algorithm uses the PCA as 
initial values, and updates the estimator iteratively. At each iteration, an EM-algorithm is 
carried out to estimate A and the empirical residual covariance h Ylt=i v&t- Then a majorize- 
minimize method (Bien and Tibshirani 2011) is used to obtain a positive definite estimate of 
the covariance E u based on i Ylt=i u$' t and soft-thresholding. The algorithm is summarized 
as follows (see Bai and Li (2012) and Bien and Tibshirani (2011) for detailed descriptions of 
the algorithm). 

1. Initialize A and u as the PCA estimators. Initialize E u as a diagonal matrix of the 
sample covariance based on the PCA residuals. 

2. At step k+1, A k+1 = AM' 1 , where 
M = A' k t- y lS y t- y lA k + /, - A' k E-'A k , 

A = S y l!i yjk Ak, = A k A' k + 

Let S u , k = S y - AA' k+1 - A k+1 A' + A k+1 MA' k+1 . 

3. Still at step k + 1, For some small value t > , let B — H u>k — £(E~ fe — E^S^E"^). 
Let 

E u>k+1 = S(B,XtK) 
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where S(A,B)ij = sign{A i j){A i j — .By) and K is a matrix whose off-diagonal K%j is 
|(>Su,fc)jjl~ 7 an d diagonal elements are zero. 

4. Repeat 2-3 until converge. 

We present a numerical experiment to illustrate the performance of the proposed method. 
The data was generated as following: {eu}i<N,t<T are both serially and cross-sectionally 
independent as N(0, 1). Let 

uu = e lt , u 2t = e 2t + ate lt , u 3t = e 3t + a 2 e 2t + he u , 

Ui+l,t = &i+l,t + O-i^it + h-lGi-l,t + Ci- 2 ei- 2 j, 

where {ai,bi,Ci\f =l are i.i.d. iV(0,0.7 2 ). Let the two factors {fu, fit] be i.i.d. iV(0,l), and 
{ Aj i , Xi t2 }i<N be uniform on [0, 1]. Then S u0 is a banded matrix. 

We apply the adaptive lasso penalty for our joint estimation, with various choices of the 
tuning parameters 7 and ht- The result is compared with the PCA estimator and the regular 
maximum likelihood restricted to diagonal T, u (DML, Bai and Li 2012). More specifically, 
DML estimates (A , S u0 ) by: 

min min^-log|AA / + S u | + ^-tr(5 ?/ (AA / + S u )- 1 ). (5.1) 

S Uj ij=0 for i^j A JM JM 

Therefore DML forces the covariance estimator to be diagonal even though the true S u0 is 
not. Hence it does not take the idiosyncratic cross-sectional dependence into account. 

For each estimator, the smallest canonical correlation (the higher the better) between 
the estimator and the parameter has been used as a measurement to assess the accuracy of 
each estimator. Tables [1] and [2] list the results of the estimated factor loadings and common 
factors from joint-estimation. 

We have also computed the canonical correlations between the estimators and the true 
parameters using the regularized two-step method (Section 3) with iterations. For compu- 
tational simplicity, the threshold value in the first step has been fixed to be the adaptive 
threshold of Fan et al. (2012) with a universal constant C = 1, which we find to maintain the 
finite-sample positive definiteness well. The results demonstrate that both two-step and joint 
estimations have higher canonical correlations, and thus outperform the PCA and DML. 

Our EM plus majorize- minimize algorithm maximizes an approximate penalized like- 
lihood function. Developing an efficient algorithm for maximizing the original likelihood 
function will be a future research direction. 
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Table 1: Canonical correlations between A^ 2 -* and Aq 













Penalized ML 








PCA 


DML 


7 = 


1 


7 


= 5 


T 


N 






fir = 0.08 


\it = 0.3 


Ht = 0.08 


Ht = 0.3 


50 


50 


0.205 


0.199 


0.212 


0.222 


0.230 


0.234 


50 


100 


0.429 


0.558 


0.591 


0.613 


0.627 


0.631 


50 


150 


0.328 


0.470 


0.494 


0.495 


0.515 


0.507 


100 


50 


0.496 


0.519 


0.560 


0.537 


0.558 


0.537 


100 


100 


0.394 


0.574 


0.621 


0.648 


0.648 


0.658 


100 


150 


0.774 


0.819 


0.837 


0.829 


0.840 


0.836 



Penalized ML uses the one-step adaptive Lasso estimation. 



Table 2: Canonical correlations between and F 













Penalized ML 








PCA 


DML 


7 


= 1 


7 


= 5 


T 


N 






fPr = 0.08 


Ht = 0.3 


fir = 0.08 


Ht = 0.3 


50 


50 


0.232 


0.234 


0.251 


0.267 


0.279 


0.283 


50 


100 


0.477 


0.640 


0.671 


0.732 


0.748 


0.749 


50 


150 


0.411 


0.599 


0.623 


0.638 


0.666 


0.650 


100 


50 


0.430 


0.446 


0.503 


0.473 


0.508 


0.474 


100 


100 


0.371 


0.579 


0.647 


0.688 


0.687 


0.697 


100 


150 


0.820 


0.867 


0.880 


0.892 


0.912 


0.903 



Canonical correlations are presented. Penalized ML uses the one-step adaptive Lasso estimation. 
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Table 3: Canonical correlations between the regularized two-step ML estimators (Section 3) 
and the true parameters 







Factor loadings 




Factors 


T 


N 


PCA 


DML 


Two-step 


PCA 


DML 


Two-step 










ML 






ML 


50 


50 


0.205 


0.199 


0.241 


0.232 


0.234 


0.277 


50 


100 


0.429 


0.558 


0.643 


0.477 


0.640 


0.752 


50 


150 


0.328 


0.470 


0.565 


0.411 


0.599 


0.731 


100 


50 


0.496 


0.519 


0.548 


0.430 


0.446 


0.469 


100 


100 


0.394 


0.574 


0.717 


0.371 


0.579 


0.758 


100 


150 


0.774 


0.819 


0.846 


0.820 


0.867 


0.927 



The SCAD(jij) threshold has been used for the covariance estimation, where Tij = aijiox with the 
adaptive threshold constant ctij proposed by Cai and Liu (2011). 

6 Conclusion 

We study the estimation of a high dimensional approximate factor model in the presence 
of cross sectional dependence and heteroskedasticity. The classical PCA method does not 
efficiently estimate the factor loadings or common factors because it essentially treats the 
idiosyncratic error to be homoskedastic and cross sectionally uncorrelated. For the efficient 
estimation it is essential to estimate a large error covariance matrix. 

We assume the model to be conditionally sparse in the sense that after the common 
factors are taken out, the idiosyncratic components have a sparse covariance matrix. This 
enables us to combine the merits of both sparsity and high dimensional factor analysis. 
Two maximum-likelihood-based approaches are proposed to estimate the common factors 
and factor loadings, both involve regularizing a large covariance sparse matrix. Extensive 
asymptotic analysis has been carried out. In particular, we develop the inferential theory 
for the two-step estimation. 

It remains to derive the limiting distribution as well as the optimal rates of convergence 
for the estimators by the joint-estimation method. This will extend the consistency results 
obtained in the current paper. In the presence of a covariance A A' that has fast-diverging 
eigenvalues, the task is difficult because it requires the consistency of the penalized covariance 
estimator under the operator norm. We intend to address this issue in future research. 
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A Proofs for generic estimators 



We need to establish the results for two sets of estimators: the two-step estimator and 
the joint estimator, whose proofs for consistency share some similarities. Therefore in this 
section we establish some preliminary results for generic estimators that can be used for 
both cases. We denote by (A, E u ) as a generic estimator for (A , E u0 ), which can be either 
(A« El 1} ) or (A( 2 ),£l 2) ). Define 

Q 2 (A, E u ) = ^tr^E-'Ao - A^A^'E^A^A'S^Ao), (A.1) 

Q 3 (A,E U ) = iloglAA' + Ej + ltr^^AA' + Ej-^-ltr^E^^-lloglEj-g^A^J. 

(A.2) 

Define the set 

E 8 = {(A, E tt ) : 5~ 1 < A min (7V- 1 A'A) < A^A^A'A) < 5, 

) < A max (E u ) < 5} 

We first present a lemma that will be needed throughout the proof. 

Lemma A.l. (%) max iJ < r |± Y%=i fitfjt ~ E fafjt\ = O p (y/l/T). 
(ii) maxjjXAT |± Y%=\ u u u jt ~ Eu it u jt \ = O p (y / (log N)/T). 
(in) maXi< r j< N \± Y,t=i U u jt\ = O p (y/ (log N)/T). 

Proof. See Lemmas A. 3 and B.l in Fan, Liao and Mincheva (2011). □ 
Lemma A.2. Under Assumption 3.2, for any 5 > 0, 



, , log A /log .V 

sup \Q 3 (A,T, U )\ = O I + 
(A,s u )es (S 



N V T 



Therefore we can write 



^ log | A A' + S M | + ^tr(5 y (AA' + E^ 1 ) 



= ^tr^E- 1 ) + 1 log |E U | + Q 2 (A, E u ) + O + ^J^L j . (A .3) 

Proo/. First of all, note that |AA' + E u | = |E M | x \I r + A'E~ 1 A|, and sup( AjEu ) gHi log |J r + 
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A'S- X A| = O (^) , hence we have 



ilog|AA' + E u | = ilog|S u | + o(^) 



(A.4) 



where O(-) is uniform in Eg. Equation flA.4j) will be used later in the proof. 

We now consider the term A _1 tr(S , J/ (AA / + With the identification condition 

T Y£=i ftft = J r, 7=0, and S u = ± ^)J =1 u^, 

1 T 1 T , 1 T 

s y = t £^ ~ y^ yt ~ ^' = A ° A '° + Su+ A °f Yl fc^ + ( A °f S ~ uu '- 

t=l i=l i=l 

By the matrix inversion formula (AA + S^)" 1 = E" 1 - S; 1 A(/ r + A'E^A^A'E- 1 , 

^tr(^(AA' + = ItrCAjjE" 1 ^) + ^tr^E; 1 ) - Ax + A 2 + A 3 - A 4 - A 5 , (A.5) 

where A 1 = A- 1 tr(A A' S- 1 A(/ r +A'S- 1 A)- 1 A'S; 1 ), A 2 = itr( I £f =1 Ao/ti^AA'+E*)" 1 ), 
^3 = ^tr( 1 ,EL«*/t A o( AA/ + E «) _1 ) ) and ^4 = ^tr^E-iA^ + A'E^A^A'E- 1 ). Term 
A 5 = A~ 1 tr(-uM / (AA / + S u ) _1 ) = O p ((\ogN)/T) uniformly in the parameter space, and hence 
can be ignored. 

Let us look at terms Ai, A 2 , A 3 and A A subsequently. Note that A max (E u ) and AA~[ n (A'A) 
are both bounded from above uniformly in E$, we have, 

sup A max [(A'S; 1 A)- 1 ] < sup x AmaX / A S : ) , = O(N-i), (A.6) 

sup A max [(/ r + A'E~ 1 A) -1 ] < sup A max [(A / E- 1 A)- 1 ] = OiN' 1 ). (A.7) 

(A,s u )es a {A,B u )eS s 

In addition, ||A|| F = 0(y/N), A max (E^ 1 ) = 0(1) uniformly in E s , and ||A || F = 0(VN). 
Applying the matrix inversion formula yields 

M = ^tr(A' E; 1 A(A'E t ; 1 A)- 1 A'E t ; 1 A ) - ltr(A' E; 1 A(A'E- 1 A)- 1 (/ r + A'E~ 1 A) _1 A / E~ 1 / 
= ltr(A' E; 1 A(A'E t ; 1 A)- 1 A'E t ; 1 Ao) + O (j^j , (A. 

where O(-) is uniform over (A, E u ) G 5$. In the second equality above we applied flA.6[) and 
( 1A.7P and the following inequality: 

ltr(A' E; 1 A(A'E; 1 A)- 1 (/ r + A'S^A^A'S^Ao) 
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< l||A[ ) S ? : 1 A||^A max [(A / S- 1 A)- 1 ]A max [(/ r + A'S^A)" 1 ] 

< 0(i\T 8 )||Ao|||||A||* W^ 1 ) = O(N^). 

By Lemma |Aj](iii), and A max ((AA' + < A max (S ? ; 1 ) = 0(1) uniformly in E s , 



sup \A 2 \ < i-IIA^AA' + Sj- 1 !^ 
(A,s„)es 5 iV 



1 T 

t=i 



logN 



T 



(A.9) 



Similarly, sup( AEtt ) 635 \ A 3 \ = O p (y ^7^-). Again by the matrix inversion formula, 

A, = ^tr^E^A'S^A)- 1 ^; 1 ) - ^tr^E^A'S^A)^! + A'E^A^A'E; 1 ). 

The second term on the right hand side is of smaller order (uniformly) than the first term, 
because it has an additional term (/ + A / S~ 1 A)~ 1 , whose maximum eigenvalue is 0(A^ _1 ) 
uniformly by (IA.7j) . The first term is bounded by (uniformly in ): 



c 

N 



||5' u E- 1 A|| f .O(JV- 1 )||A / E- 1 || f . < 0(iV- 1 )A max (5. [1 ) = 0{yj 



T N' 

Hence sup (ASti)eSs \A A \ = O^^log A0 1/2 + N' 1 ). Results (JX3J) and flA~5l) then yield 



1 log I A A' + SJ + ^tr(^(AA' + EJ- 1 ) 

= itrCAiE^Ao) + ^trC^S; 1 ) + i log |E U | - ltr(A' S; 1 A(A'S; 1 A)- 1 A'S- 1 A ) 
( > / logiV | / log N " 



N 



T 



^(S^- 1 ) + 1 log |E B | +Q 2 (A,Ej + i^ + J^ 



□ 

Throughout the proofs, we note that the consistency depends crucially on the consistency 
of the following quantities: 

J = (A - Ao/E^AtA'S^A)- 1 

We state the following lemma for the generic estimators. 
Lemma A.3. (%) A^E^Aq - (I r - J)A'£- l A(I r - J)' = o p {N) 
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(ii) First order condition: A'(AA' + E^) -1 ^ - AA' - E u ) = 0. 

We will prove Lemma [A. 31 for both (A^^sL 1 ^) and (A( 2 ),Eu ) later when we deal with 
these two estimators individually. 

Lemma A. 4. Suppose Lemma \A.3\ holds, then 

(i) A!Ez 1 {S v -AA'-% t ) = 0. 

(ii) (J - I r )'(J - I r ) - I r = O p {N- 1 + T-V^iog AT)V2). 

Proof, (i) Using the matrix inverse formula, the same argument of Bai and Li (2012) 's (A. 2) 
implies A'(AA' + E u ) _1 = (I r + A'E^A^A'E" 1 . Thus part (i) follows from the first order 
condition in Lemma [A. 31 

(ii) Let H = (A'E^A) -1 . Part (i) can be equivalently written as J + J' - J' J + K = 
where 

1 T ~ ^ _ l T l T ~ ^ _ l T 

K = J'-Y. f^-'AH + HA'T,- U l - UtftJ ~ f Yl f^- u l AH - HA'T,^ 1 — £ u t f[ 
t=i t=i t=i t=i 

-tfA'E^^-EjE^Atf. 

Note that for (A, E u ) E E 5 , H = OpiN' 1 ), J = O p (l) for each element, HE; 1 )! = O p (l), 
\\A\\ F = O p (VN) } hence 

t=i *=i 

Moreover, for the empirical covariance \\S U \\ 2 < 2J2 i j < N(T~ 1 Ylt=i u it u jt — °~v,o,ij) 2 + 
2||E m0 || 2 = O piT^N 2 log N + I) by Lemma P which implies HA'^Sjt^H = 
OpiN- 1 +T- 1 / 2 (\ogN) 1 / 2 ). Also, J ffA / E- 1 E [1 E- 1 Ai/ = H = O p (^ 1 ). Therefore K = 
O p (N~ 1 + T~ l / 2 (\ogN) 1 / 2 ). It then implies (ii). 

□ 

Lemma A. 5. Suppose Lemma L4. 31 holds, then J = o p (l). 

Proof. By our assumption, both A / E„ 1 A and AqE^qA are diagonal. Moreover, the eigen- 
values of A^A'E^A and A^A^E^A are bounded away from zero. Therefore by Lemma 
IA.3( i) and Lemma lA.4( ii). there are two diagonal matrices Mi and M 2 whose eigenvalues 
are all bounded away from zero, such that 

{I r - J)Mi(/ P - J)' = M 2 + O p (l), (J - I r )\J - I r ) = I r + O p (l) (A.10) 
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Applying Lemma A.l of Bai and Li (2012), we have J = o p (l) and M\ = M 2 + o p (l). We 
also assumed A and A have the same column signs, as a part of identification condition. 

□ 



B Proofs for Section 3 



-i 



In this section, (A, E u ) = (A« E^) and J = (A (1) - A )(si 1) ) _1 A (1) (A (1) '(si 1) ) _1 A (1) 
Throughout Appendix B, we will let H = (A( 1 )'(e! 1 ' ) )~ 1 A( 1 ))~ 1 . For notational simplicity, 
we let 

1 /log A 

UJ T = — p= + ' 



a/A V T 
We first cite a result from Fan et al. (2012): 

Theorem B.l (Theorem 3.1 in Fan et al. (2012)). Suppose (logA) 6 ^ = o(T) and VT = 
o(N), then under Assumptions 3.1- 3.5, 

l|S?> - Euoll = O p = II(SW)- 1 - S- 1. 

Proof. The sufficient conditions of this theorem are satisfied by our assumptions. See Fan 
et al. (2012). □ 

We then prove Lemma IA.31 which then enables us to apply Lemmas IA.4I and IA.5I Under 
Assumptions 3.1- 3.3, there is 5 > such that (A , S u0 ) G an d (A^Eu ) G with 
probability approaching one for H$ in Appendix A. 

Lemma B.l. For (A, S„) = (A (1 \S„^), Lemma HOI is satisfied. 

Proof. The first order condition with respect to in (ii) is easy to verify, which is the 
same as that in Bai and Li (2012). We only show part (i). 

By definition, Li(A^) < Z-i(Ao). Also the representation defined in Lemma IA.2I yields 

Q 3 (A, S u ) + Q 2 (A, £ u ) = L X (A) - A- 1 tr(5 tt (£( 1 ))- 1 ) + A" 1 log 

Thus 

q 2 (a« £«) + g 3 (A (1) , s«) < q 2 (a , £«) + q 3 (a , £«) 

Note that Q 2 is always nonnegative and Q 2 (Ao,£l 1 ' ) ) = 0. Therefore by Lemma IA.21 < 
Q 2 (A ( ^ 1 \e1 1 ^) = o p (l). Moreover, the matrix in the trace operation of Q 2 is semi-positive 
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definite, hence 

^AoC^r'Ao - (ir - ^^Wr 1 ^ 1 ^ - jy = ^w- (b.i) 

It remains to show that A^ _1 Aq((S^) _1 — S,^ 1 )Ao = o p (l), which follows immediately from 
Theorem IB. II and that m^uj}f~ q = o(l). 

□ 

B.I Proof of Theorem 3.1 
B.l.l Consistency for A^ 1 ) 

The equality ( IB.lj) implies 

i(A« - Aoy&^r 1 ^ - Aq) - ijA^CEW)-^) / = o p (l). 

The second term is bounded by iV" 1 1| J|||.||AW|||,||(E^ 1) )- 1 || = P (||J||J.). Lemma then 
implies the second term is o p (l), which then implies that the first term is o p (l). Because 
has eigenvalues bounded away from zero asymptotically, we have A^HA^ — A |||> = 

Op(l). 

B.1.2 Consistency for A^ 1 

Lemma IA.4I (i) can be equivalently written as: for any j < N, 

A« - X oj = -J'X oj + jyAW'CEW)-^ (b.2) 
where E^] denotes the jth column of E« , and aj is an iV x 1 vector 



a i 



t=i t=i t=i 

The consistency of maxj<7v — Aoj|| follows from Lemma [A. 51 and the following Lemma 

El 

Lemma B.2. max^ \\HAW (HP)-^ = O p {m N N~ l / 2 u}f q + T _1 / 2 (log N) 1 / 2 ). 
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Proof. By Lemma IA.1| uniformly in j < N, 



T 



t=i 



Finally, maxj<jv ||iTA^(Eu = O p (\ogN/T). The result then follows from a trian- 

gular inequality and that rriNUJ^f q — o(l). □ 

B.2 Proof of Theorem 3.2 

B.2.1 Uniform rate for 

By (IB. 21) . the uniform rate of convergence follows from Lemma lB.21 and the following 
Lemma IB. 31 

Lemma B.3. J = O p (m^uj]r q ) . 

Proof. The first order condition in Lemma [A. 41 (i) is equivalent to: 

J'J + J' + J + HA^'^^B^y^H = (B.3) 

where B = AqT" 1 £f =1 / t «J+(A T- 1 £f =1 f t u' t )'+S u -W -uu' . We have, ||A || F = 0(VN), 
uu' = O p (N\ogN/T), and H^-E^H < ||E« -E u0 || + ||£ u -E u0 || = O^NT- 1 ' 2 (log Nf/ 2 + 
mjvc4~ 9 ). Therefore ffAt 1 )'^)- 1 ^^)- 1 ^ 1 ^ = O p (T-V2 (log N y/2 + mif N-i u ^ 
Since J = o p (l), J' J can be ignored. It follows from (IB. 31) that 



J' + J = O p (^+^). (B.4) 

Let J y denote the (ij)the entry of J. It then follows that Ju = O p (T" 1 / 2 (log A^) 1/2 + 
mjsrN^Uj, q ) for all i < r. It is also not hard to verify that a/ (log N)/T = 0{mNOo ] f q ) for 
any < q < 1 since m^v > 1. 

On the other hand, due to the identification condition, both AqE^Aq and 
A W (Em 1 ** ) ~ 1 A( 1 ) are diagonal. Let ndg(M) denote the off-diagonal elements of M. Then 
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ndg(A' S; 1 A ) =ndg(A( 1) '(sl 1 V 1 A (1) ) = is equivalent to 

ndg{(A« - Ao)'(E^r l A^ + A^' (^)-\A^ - A )} 

= ndg{-A' ((£( 1 ))- 1 - S^)A + (A« - Ao)'®))" 1 ^ - A )} 

Note that if ndg{Mi} = ndg{M 2 } then ndg{HMxH} = ndg{HM 2 H} for two matrices M 1 
and M2 since H is diagonal. Also, (AW - Ao)^^)" 1 ^ 1 )^ = J. The above identification 
condition implies 

ndg{HJ + J'H} = ndg{-i/A' ((S( 1 ))- 1 - E^)A i? + #(A« - A )\^)-\A^ - A )H} 

(B.5) 

Note that FAq^E^)" 1 ~ s «o) A o^ = O p {m N N~ 1 uj l T q ). Let ^ denote the ith diagonal 
entry of H. Let X = (A« - A )'(sL 1) )" 1 (A (1) - A ). Then for i ^ j, (|R4]) and (JEHD imply 
that 

m N uj 1 ~ q 

By assumption, with probability one, there is 5 > such that (A^) -1 < ft,^ < N~ l 5, and 
/ijj 7^ hjj for i 7^ j. Moreover, since all the eigenvalues of S u are bounded away from zero 
and infinity, wpal, ||A — A ||^ > c||X||p for some c > 0. Then the above two equations imply 
that for any i ^ j, = O p (mN0J T ~ q ) + O p (N~ 1 )X ji (since v/log N/T = 0(m N u]r q )). Then 



| = O p (m>^ + ^||X|||). (B.6) 



Moreover, by Lemma IB.21 maxj<7v 

We now show that J = O p {mN(df q ). Suppose this does not hold, then ( 1B.6[) implies 
J = O p {N~ 1 X). By the definition 

X = (A( 1 )-A ) / (Si 1 ))- 1 (A«-A ), 

\\X\\ F = O p (||A« - A |||). Therefore J = O^N^X) yields || Jf F = O p (N~ 2 \\A - A \\%). 
The first order condition ( 1B.2[) also yields 



max ||Af - A 0J || 2 = O p (\\ J\\ F ) = O p {N^\\A^ - A || 4 F ), 
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which implies || AW - A ||| = Y$ =1 \\Xf ] - X 0j \\ 2 = O p (A r_1 1| A (1) - A |||). Therefore 

1 - » X(1) - A »^ =0^(1), 



AT-iyAW - A ||| AT-i||A(i) _ A 



If 



which contradicts with the consistency A" 1 ||A^ 1 ) — Ao||^ = o p (l). This concludes the proof. 

□ 

(!) _ v.H _ n f\\ t\\-\ — n t™..,, 1 -^ 



Therefore, (1B.2[) gives maxj<Ar || \y — Aqj|| = O p (||J||.f) = O p {m^uj T q ). The rate 



of convergence for A" 1 / 2 ||A < - 1 ^ — Ao||f then follows immediately since it is bounded by 
maxjxjv || \f ] - X 0j \\. 



B.3 Proof of Theorem 3.3 

By the definition of the covariance estimator in the first step, e! = {sij(Rij))NxN, where 
Sij is a chosen thresholding function. It was shown by Fan et al. (2012, Theorem 2.1) that 
Rij is the PCA estimator of T" 1 Ylt=i u it u jt, that is, Rij = T~ l Y^=iuft CA Ujf A . 

Lemma B.4. For any e > 0, and any constant M > 0, for all large enough N,T, 

P(\Rij\ > M Tij ,V{hj) eSu)>l-e. 
Proof. We have, \Rij\ > \T, u0ti j \ — |E u0 ,ij — Rij\- Thus for all large enough N, T, 

P(|%| > M^,V(i,i) G Su) > P(\X u0 J > M Tij + |S u0iij - RijlMhj) e S v ) 

> P(|£ u0 ,d/ 2 > |E 

u0,ij 

R ij \,V{i,j)eS u )>l-e, 

where in the second and last inequalities we used the assumption that ujt = 
o(min ( jj) e5c/ |S u0 ,iil) and the fact that maxy |E u0 ,ij - Rij \ = O p (u T )- 



□ 



Proof of Theorem 3.3 

By Fan et al. (2012), maxy \R^ — X u o,ij| = O p {ujt), which implies for any e > 0, there 
is C > such that P(maxjj \Ry — £ u o,y| > Cut) < e/2. For some universal M > 0, we set 
the threshold Tij = MaijUT at entry (i,j), where aiy is a data-dependent value that satisfies, 
for any e > 0, there is G\ > such that P(aij > Ci, Vz ^ j) > 1 — e/2. Then as long as the 
constant M in the definition of the threshold is larger than 2C/C\, 

P(max \Rij — S u0 ,y| > minTjj/2) < P(max \Rij — £ u o,#| > MC\UJt/2) + e/2 < e. 

i,j ' ij i,3 
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Note also that if Sij(Rij) 



0, then \Rij\ > Ty, by the definition of s#. This implies, 



P&V ^ 0, 3 



G S L ) < P(\Rij\ > nj,3(i,j) G S L ) < P( max \Rij\ > minry) 



(i,j)eS L ij 



Since max (ii)eSL |S u0 ,ij 




P{W ^ 0, G S L ) < P(max \R tj - £ u(W | > minr^/2) < e. 



On the other hand, for arbitrarily small e > 0, P(maXjj < Kujt) > 1 — e/2 for some 
if > 0, which implies P(|i2y| > Mw T + Kw T ,V(i,j) G 5c/) < P(|%| > Mu T + 7y, V(i, j) G 
5*1/) + e/2. By the definition of s^, — z| < Ty for all z£l Therefore |Py — = 

\Rij — Sij(Rij)\ < Tij, hence for arbitrarily large M > 0, 

P(|£«| > Mu, T , V(i, j) G 5c/) > P(|^-| > M WT + - E«.|, V(i, j) G 5a) 

> > (M + K)u T ,V(i,j) eSu) - e/2 > 1-e 

where the last inequality follows from Lemma IB. 41 

B.4 Proof of Theorems 3.4 and 3.5 
B.4.1 Proof of Theorem 3.4 

A simple derivation implies that ^-^((E^)" 1 - Ko) a o\\f < N' 1 1| A ||||| (S^)" 1 - 
S~q|| = O p {m N uj}f q ). This rate is not tight enough for the A/T-consistency and limiting 
distribution . A more refined rate of A _1 Ag((sl 1 ^)~ 1 — S~ 1 )A depends on the convergence 
properties of the PCA estimator. We begin by citing some results proved by Fan et al. (2012). 
Recall that R^ denotes the (i, j)ih entry of the orthogonal complement covariance in the 
sample covariance's spectrum decomposition, and = Sij(Rij). 

Let {uu}i<N,t<T be the PCA estimates of {u it }i<N,t<T- Let \^ CA and Jf CA denote the 
PCA estimators of the factor loadings and factors. 

Lemma B.5. (i) For any i,j, with probability one Rij = T~ l Ylt=i %S>ju 
(ii) max^ArT" 1 J2t=i(uit - u it) 2 = O p (u)j.)- 

(Hi) There is a nonsingular matrix H such that T~ l Ylt=i \\ff CA ~~ H f t \\ 2 = O p (T~ l + A -1 ) 
and max.,- ||Aj' c ' j4 — H'~ 1 \ j\\ = O p (oot)- 
(iv) maxij< N |Py - S u0 ,y| = O p (uj t )- 
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Proof. See Theorem 2.1 and Lemma Cll of Fan et al. (2012). □ 
Lemma B.6. ^ Ef=i ££i u it >! m H-\f? CA - Hf t )U'i = O p {^ + ± + £). 

Proo/. By Bai (2003), there are two r x r matrices # and V, || V|| F = P (1), ||#||f = O p (l) 
such that - ff/ t = V(NT)- 1 ELi + /i Ef=i Ao;U it + / t ' Ef =1 ^J. The 

desired result then follows from the following Lemma IB.71 

□ 

Lemma B.7. (i) ^ Ef =1 Z?=i u it \' m H-\NT)^ EL /? CA <^£ = O p (^ + 1 + £) 
W Eli Eii ^A'o^-^iVT)- 1 ELi ? CA /; Ef=i Aoi^' = P (^ + i) 

(M) m Eli Ef=i ^^(jvt)- 1 ELi /? ca A' Ef=i V*,-^' = o p {^ + 1). 

Proof, (i) We have, 

T TV T T iV T 

n aTt E E E ^v^'^ii < ii ^ E E «* E /^x^'a^ii 

t=l i=l s=l t=l i=l s=l 

T N T 

+ 11 ^ E E «« E^' - /^X^" 1 ' = a + b. (B.7) 

t=l 1=1 s=l 

We bound a, 6 separately. Here a is upper bounded by a\ + a 2 , where by Cauchy-Schwarz, 

^ T N T 

t=i i=l s =l 

T T T 

< max ||Aq^II(^ E^) 1/2 H^(^ E 4E - ^)f) 1/2 

t=l t=l 8=1 

T T 

< O p (l)(-^\\ — ^f s (u' s u t -Eu' s uM 2 ) 1/2 - (B.8) 

t=l s=l 

Note that E± Ef=i II A ELi /-K«t " Eu' s u t )f = E\\± EL /»* - £<OI| 2 , which 
is 0(T _1 A^ _1 ) by Assumption 3.9. Hence a x = O p ((A^T)- 1 / 2 ). 

^T/VT 1^ 1 T 

° 2 = II^^EE^E^'^^^^Ao^C'll < max-^|w it |0(l)^:^||/ SJ E;<w t || 

t=l i=l s =l ~ t=l s=l 

(B.9) 

Since maxKT^T-^^ELill/^^ll) < 0{T~ l ) max f Ej =1 \Eu' s u t \/N = 0(T- 1 ) by 
the strong mixing condition (Lemma C.5 of Fan Liao and Mincheva 2012), we have a<i = 
O p (T~ 1 ). This implies a = O p {N- x ' 2 T- 1 / 2 + T _1 ). 
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Now we bound b. Using Cauchy Schwarz inequality, we have b < bi + b 2 where 

T N T 

t=l 1=1 3=1 

T N / T \ / T \ 



< 0,(1)^ y. E w ? E ii/? c " - »/>n 2 ? E w«. - £ < 

t=l i=l \ s=l / \ s=l 



Ut 



2 



< o p( i)o p (-^ + -^nm = o p( i + -j=), (b.io) 

where the second inequality follows from ET~ l EjJ =1 |tigixt — £ , i4'Ut| 2 = O(N). Using Cauchy- 
Schwarz inequality, we also obtain 

T N T 



2 ii j\r 2 T 2 

(=1 i=\ s =i 



< o P (i)(± e - ^f) i/2 e \Eu>u t / N A 

s=l \ s=l J 



1/2 



= O p (— = + h. (B.ll) 

(ii) Let di t ki be the (/c, /)th element of Then the (k, /)th element of the object of interest 
is bounded by d\ + d 2 , where, by Cauchy Schwarz inequality, 



di = It^EEEX>^-^^ 

V / t=l s=l 7 = 1 1=1 



3 = 

T , T N N T 



< e n/? cA f) i/2 (^ e iii? CA n 2 ) 1/2 ii^ E E - s«««*)^qA«ii 

s=i s=i i=i i=i t=i 

- O^). (B,2, 

The last equality follows from Assumption 3.9. Also, Yl%j<N \^ u it u jt\ = Yl(ij)es v \^uo,ij\ + 
£( iJ)6 sJ E i*«l = <W Thus 



= i mv E E E ^(^iO^-^/jAp,-^*,! 

V / t=l s = l j = \ i=l 
T 

^ OpW^Ell^llll/J E l^ t ^l=O p (-). (B.13) 

s=l i,j<N 
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(iii) The object of interest is bounded by e\ + e 2 , where 

T N T N 

^ = \\jpt2 E E EE^ A i. rl (f ci - //.a)//Ao,-,,c<; = o P (-= + -), (b.u) 

s=l j=l t=l i=l * 

and we used the fact that 1 £f =1 - #/ t || 2 = OpiT' 1 + iV" 1 ) from Lemma and 

that iV- 1 Eti Wt EL /t««|| = Op(r-V3)H 

^ T N T N 

e * = ii^EEEE^^^-^ii = (R15) 
=1 j=i t=i i=i 



□ 



Lemma B.8. For Su in the partition '■ i,j < iV} = Si U SV? 

ft m EL Ei^esv u it X' 0j H-\NT)^ ELi /T^X^ = + * + i 



1 V T TPCAf/sr< N \ „, t ti — n ( l lo % N 



(in) ^ELE^j)^^ = O p {J0*£ + 



log JV N 

T ' 



Proof, (i) The term of interest is bounded by a + b, where 



^ T T 

j?f^ E E Mi * E f,H'u' 8 UtH- v 'XoM'j II 



iV 2 T 2 

*=i (i,j)eSu,i& s=i 

T T 



^2 E E «« E(^' - ./://')<"/// %e«su. 



iV 2 T 2 

*=i (i,j)eSu,i& s=i 

Here a is upper bounded by ai + a 2 , where 

ai = || jv^t ELi E(ij)es p .<#i Ui * ELi f s H'W s u t ~ Eu^H- 1 ' Xoj^W, and 

a 2 = ||]^XiiE(ij)6Str,^j w «ELi/«^ /Sw X-H'~ 1 'Aoi^||- Note tliat fl i and fl 2 can be 

bounded in the same way as (IB.8P and (IB.9H . The only difference is that iV -1 YliLi * s replaced 

by a double sum N^Y^MeSv&j' B ^ the assum P tion > ^ v_1 E(y) 6 5^i 1 = The 
result of the proof is exactly the same, so is omitted. We conclude that a = O p (N^ 1 ^ 2 T^ 1 ^ 2 + 
T- 1 ). 

On the other hand, b < b\ + & 2 where 
61 = Wife* Ef =1 E^)^,^ ELitfJ™' - - ^O^^'Ao^^ll, and 



'We have (JV" 1 £^ II T Et=i /*^*ll) 2 < ^ ££1 II T ELi /^*ll 2 = ^ E 4 =i ^(tELM*) 2 , 
whose expectation is TV -1 Y^,iLi £j=i var (r £t=i fjt u it)- Note that var(y Et=i fjt u it) = 0(T _1 ) uniformly 
in i < iV. 



38 



h = II^Ef=i£( M)e ^ ^E^ Using Cauchy- 

Schwarz inequality and the strong mixing condition, b\ and b 2 can be also bounded in an 
exactly the same way of flRTOjl and (iRlTT) . We conclude that b = O p (N~ l +T" 1 + (NT)- 1/2 ). 

(ii) Let dij^i be the (k, l)th element of Then the (k, l)th element of the object of 
interest is bounded by d\ + d,2, where 

d\ = \ jjfrpJ2t=i Sl=i J2(i,j)eSu,i^j I2v=i( u it u vt - EuitU vt )X' Qj H~ 1 f^ CA f' s \ovdij t ki\, and 
d 2 = l(j^ELi£LiE(ij) 6 ^,^Ef=i(^^^ Bounding d u d 2 

is slightly different from (IB.12[) and (IB. 13|) . and we give the detail here. By Cauchy Schwarz 
inequality 

T T T N 

s=l s=l i^j,(i,j)£Su t=l v=l 

which is O p ((NT)~ l l 2 ) by Assumption 3.9. On the other hand, 
d 2 < O p (N~ 2 ) £ 

&3,(i,3)£S v Efc=i \^uo,ik\- Note that ||S u0 ||i - 0(m N ), where m N is as defined 
in Assumption 3.1. Thus d 2 = O^N^m^)- 

(iii) The object of interest is bounded by e x + e 2 , where 

e l = || jv 2 T 2 Es=l Ei^ij'JeSt/ Et=l Ev=l u it^QjH l (ff CA — H fg) fj.\ 0v U vs £i£'j\\, 
Cl = || JV 2 T 2 Es=l Ei^i^eSu Et=l E«=l w ii^dj/s/t^0« W„ s ^i^- 1| . 

Since maxj<jv IjT^ 1 EtLi ft u it\\ — O p (y/\og N/T), we conclude that e\ = O v 1 ^ log — 



T 



^g), and e 2 = O p (M). □ 

From Lemma [B. 81 immediately we have the following result. 

Lemma B.9. ^ £f =1 E*#,M)es uuK,H-\ff CA - Hf t )^ = O p (u 2 + m N /N). 

Proof. Note that results (i)(ii)(iii) in Lemma TB. 81 sum up to O p [i>j\ + j 'N) . Hence Lemma 
El follows from the equality ff CA - Hf t = V(NT)~ l £j =1 Tr»t + f' s Ef=i *ojU jt + 

The following lemma strengthens the results of Bai (2003) when E u0 is sparse. 
Lemma B.10. For the PC A estimator, 

(i) n- 1 Y2ti(Rii - s u0 ,n)6C' = o p (u 2 ). 

(H) N- 1 Ei&^eSviRij ~ = O p (uj 2 + m N /N). 

Proof, (i) N- 1 J2li(Rii ~ = Eti(^* ~ S^&S/N + Eti(S u ,u ~ ^mM'JN. 

By Assumption 3.9, EL(Su,u ~ S«o,«)^/^ = EliElMt - £«o,n)&£/ '(NT) = 
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O p (l/V NT). On the other hand, j? Ei=i(-^»» ~ Su,u)£i£'i is equal to 

^ N T N T 2 N T 

-fif E ~ U «)^< = Jff E E^* ~~ "«) 2 ^ + ^ E E ~ U «)&& 

i=l t=l t=l t=l i=l i=l 

The first term on the right hand side is O p {u)j). We now work on the second term. By Bai 
(2003), there is a nonsingular matrix H such that 

u jt - u 3t = X' 0j H-\fT CA - Hft) + (Xf CA - H'-i\ 0j )'(fr A - Hft) + (Xf CA - II 'A,, !///, 

(B.16) 

By LemmaEl^ EL YOLx u lt X' 0i H-\f t PCA - Hf t )^> = O p {^ + ± + j 7 )- In addition, 
for each element of 

NT NT 

E E MX^-H'^X^Hf^M <^E H rf i.«r E max IIAf^-^ll, 

j=i t=i i=i t=i J 



Which is O p {u T yJ^-). AISO, 

NT T N 

j=l t=l t=l i=l 



1/2 



(t t a 1 " \ 

± E H/? CA - Wn** ||Af* - ^'^Aoill 2 ^ Ei4 E IMi,«ll 2 
t=i J *=i i=i / 

(ii) Since Rij = T _1 Et=i UuUj t , the term of interest equals 
2 1 ^ 1 1 ^ 

a? E ^^m%-%)(^ + t7 E ^E^« _M ^^' t_ ^*^4 



iV ^ T ^— ' -""^ iV ^ T 



1 T 

E E^ m *' m ^ — ^-"ii./j)Cs'- 



iVT 

By Assumption 3.9, the third term is O p ((A^T) -1 / 2 ). By the assumption that Ylii^j {ij)es v ^ = 
O(N) and Cauchy Schwarz inequality, the second term is O p {oSj). We now work out the first 
term. Again we use the equality u jt -u jt = X' 0j H- x {ff CA -Hf t ) + (Xj CA -H'- l X 0j )\f[ CA - 
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Hft) + (\f CA - H'- l \ 0j )'Hf t . Lemma ES gives 

i^j,(i,j)&Su t=l 

On the other hand, ^ Eiyy,(i,j)eSu f Ej=i M j*(Af CA - H' 1 ' H is bounded by, 



since max^ ||&|| = 0(1). Also, i E^,;)^ T EL ^(Af^-^^'Ao,)'^-^)^ 
is bounded by 



^Eh/*-^ii 2 fE^ E 



1/2 

12 



which is O p (^ + ^). 



Proof of Theorem 3.4 A^A^^V 1 - £~o) A Q = O v {u^ 2q m 2 N ) 
Proof. By the triangular inequality, the left-hand-side is bounded by 

The first term is O p {uf^ 2q m 2 N ). We now bound the second term, which is 
1 1 N 1 

i=1 iH,(i,j)£Su 



□ 



where H = AqE~ . The first term on the right hand side is O p (u;f.) by Lemma [B.101 The third 



term is dominated by, Oftf" 1 )^ |£ u0 ,d + E 5i IEJ&I) = O^- 1 ) + O^- 1 ) £ 5l 1^1- 

•N E(ij) e s L 



By Theorem 3.3, for any e > and any M > 0, P(^E (ij)eSi 1^1 > Mu>$) < P(3(i,j) e 
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Sl, ^u lj 7^ 0) — e - This implies the third term is O p (oo^). The second term equals 

By Lemma MM (n), E^,( tj)eS[7 (^ - E «o,ii)&^ = P K + m^/N). On the other 
hand, recall that |sy(z) — ^| < or?- when \z\ > br^ (Section 3.1), 

ii^ E - = n^ E M**) - H >i^j 

+^ E - ^ + 11^ E - ^ 

^i,(i>i)6St/,|-Ry|>&Ty i^i,(i,i)6S'c;,|ii i; ,-|<6Ty 

Write v = ||^" 1 E< ? fej,(ij) 6 Str,|fly|<fr^(%( i2 <i) ~ R ij)^jl then for an y C > 0, and 
e > 0, Lemma EH implies P(u > Mu\) < P(3(i,j) G Su,\Rij\ < br^) < e, which 
yields v = O p (uj^). Therefore jfJ2t^j,(i,j)eSu(^u}j ~ s «o,y)&£j = O p (u%). This implies 
^-^((E^)- 1 - E^)Ao = O p (f4 + u 2 - 2q m 2 N + m N /N) = O p {u 2 f 2q m 2 N ). 

□ 

B.4.2 Convergence rate for J 

We now improve the rate in Lemma IB. 31 

Lemma B.ll. (i) MW^V^Ao^LK + (^4 E t =i MYlffiY^B = 
O p (m N T- l l\\ogN) l l 2 uj 1 T q ). 

(ii) Hl^% l {S u - W^T^Y^H = O p (m N uj^r q T~ 1 / 2 (log N) 1 / 2 + rriNU^N- 1 ). 
Proof, (i) By Theorem 3.2, 

||A« - AoIIj, = O p {VNm N ^) = WA^'^)- 1 - KKoWf- (B.17) 
Therefore the RHS of part (i) equals 



T T 

Ao^EM + CAo^EM)' 

t=l t=l 



E^ 1 A tf + O p (m N \j ^4~ 9 ) • (B. 18) 



Now it follows from Assumption 3.10 that 



E /*< S «o lA o = 4f E E = °p(t=) = O p (m N J 1 ^ 

t=i t=i i=i V lv ± 



-w T y )> 
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which then yields the desired result. 

(ii) Recall that \\S U - = Op^T" 1 / 2 (log AO 1/2 + m N u^ q ) and that 
IIAW'^)- 1 - AdE^oH-F = O p (^Nm N uj l T ~ q ). By TheoremEU the RHS of (ii) equals 



By Assumption 3.10 (note that H = O p (A^" 1 )), 

1 ^ 

H K^uo( S u - S u0 )S~ 1 A 0J ff = -H ^2( u it u jt ~ EuaUjt^igjH = O p {—= 



rp" / j / j\~u~Jf «.-Ji./'s*'sJ" ~p\ /NT' 

i<N,j<Nt<T * 



□ 



Lemma B.12. J = O p (m 2 N u'^ 2q ). 



Proof. By (1B.3|) and Lemma IB. 11} ignoring the smaller order J'J, we have 



This implies that J u = O p (m N u l T q (T- 1 / 2 (log N) 1 / 2 + A^ 1 )). 

Moreover, since #(A« - Ao)'^) -1 ^ - A )# = O p (N- 1 m 2 N uj 2 ^ 2q ), (JR5| and 
Theorem 3.4 imply ndg{HJ + J'H} = O p (N~ l m 2 N u T ~ 2q ). Therefore for % ^ j, = 
O p (m 2 N u 2 7 2q + m N oo 1 T q (T- 1 / 2 (log N) 1 / 2 + N- 1 )) = O p (m 2 N u 2 T ~ 2q ). The desired result follows 
immediately. □ 

B.4.3 Improved rate for Xy 

Lemma B.13. (i) HA^'iW)- 1 ^ £f=i(^ - sg) = O p (m N u 2 T - q + m 2 N u^ 2q N' 1 / 2 ) . 
(ii) HAWpPy^^UtflXoj = O^m^T- 1 / 2 (log N) 1 / 2 ). 

Proof, (i) We have, \\A^' (W)' 1 - KKo\\f = O p (VNm N u l T q ). Hence H{A^' {f!^)' 1 - 
KKo)T- 1 Y J LMu Jt -^ { u])=O p (m N ^ part 
(i) equals 

T 2 , ,2-2g 



HA ; T,jT 1 ^2(u t u jt - E(u t u jt )) + O p (m N u 2 T q + 



m M uj, 



n uj t 



/AT 



where O p (-) is uniform in j < N. By Assumption 3.10, for each j < N, 

T -.NT 

HA'^T- 1 ^(« (% - E(u t u Jt )) = H— J2J2^ U ^ - E i u ^)) = O p ((NT)- 1 ' 2 ). 



T 

t=l 8=1 t=l 
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(ii) We have (A^gW)- 1 " A^ 1 )^ 1 £f =1 u^'l^ = O^m^T- 1 ' 2 (log N) 1 / 2 ). 
Hence (ii) equals 



t=i 

By Assumption 3.10, the first term equals NH(NT)' 1 J2? =1 ELi CiUitfl^oj = O p ((NT)-^ 2 ), 
which yields the desired result. □ 

Lemma B.14. For each fixed j < N, 

a« - = HA^^r^ /*%« + o P {mW T 2q )- 



t=i 



Proof. Note that those two terms in Lemma [B. 131 (i) (ii) are dominated by O p (m 2 N u^ 2q ). 
Therefore, the desired expansion follows from the first order condition ( IB. 21) and Lemma 
IRT21 □ 

B.4.4 Proof of Theorem 3.5 
By Lemma IRT1 and (jBTTTji 



t=l t=l 

1 /l — iV"" 1 
= TfYl f tU i l + °p( m ^ w r~ 29 + rn N u l T ~ q J ^y~) = — ^2 ft u Jt + O p (m 2 N u 2 T 2q ). 
t=i t=i 

By the assumption that m 2 N u 2 ^ 2q = o(T~ 1 / 2 ), we have \/T(\j — Xqj) = T -1 / 2 EtLi ft u jt + 
o p (l). The limiting distribution follows since 

T 

T- 1/2 f&it ~> d N r (0, E(u jt f t fD). 
t=i 

B.5 Proof of Theorem 3.6 

For any t < T, y t — y = A f t + u t — u. Hence 

F -ft = -J' ft + (XW'CEW)-^ 1 ))-^'^ 1 ))- 1 ^ - u). (B.19) 
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B.5.1 Convergence rate 



Since both f t and u t have exponential tails, using Bonferroni's method we have, 
maxtll/tH = O p ((logT) 1 / r2 ) and max ( \\u t \\ = O p (v / A(logT) 1 / ri ). Thus by Lemma E33 
maX[<T ||<^'/t|| — O p (m 2 N uj^r 2q {\ogT) 1 / r2 ). The term with u in (IB.19[) is of smaller order hence 
is negligible. Also IKAW'^)- 1 ^ 1 ))- 1 ^ 1 )'^)- 1 - A^ 1 )^ = O^N-Wm^ 9 ), 
where we used \\AW (T,^)- 1 - A' E^\\ F = O p (^Nm N u l - q ). Hence 

^(XW'CEi 1 ))-^ 1 ))-^)'^))-!^ - u) = O p (±)A' Q ^u t 
+O p (m 2 N u 2 f 2q (\ogT) 1 ^+m N uj]r q (\ogT)^ 

Finally, because E(-k A £~ Utu'^X^ Ao) = -^AqE^Ao, whose eigenvalues are bounded. 
Hence ^=A £~ Ut = O p (l). Also, O p (N~ 1 / 2 ) is of smaller order than 
O p (m N uj l T ~ q (\ogT) l ^ +1 ^ +l ). This implies 

||F - / t || = O p {m N ^{\ogTf/^'^). 

The above proof also shows that the rate can be made uniform if max. t <T \\^^^'o^uo u t\\ = 
O p (logT). 



B.5.2 Asymptotic normality 

Recall that H = AqE^q 1 and /3 t = H~QU t . 
Lemma B.15. For any fixed t < T, N~^ 2 (A^ - A ) , S^ 1 m 4 = 0p (l). 
Proof. We expand A^ — A using the first order condition 



T T 

(A« - Ao)' = JA' + ^(Sf))" 1 ^ £ /X + = £>./X + S u - £«] (B.20) 



> rp / j J S 1 rp 

S=l 8=1 



and investigate each term separately. First of all, since J = O p (m 2 N Lij^ 2q ), and by assumption 
that A' E~QU t = J2iLi& u it = O p (VN), we have N~ l/2 JA' Q T^u t = O p (m 2 N u^T 2q ). Second, 
by the assumption that (TA)~ 1//2 Y^=i fs u ' s ^uo u t — O p (l), we have 

T 

* s=l * 

Third, A-^ifAW^gW)-!!^^/^'^- 1 ^ = O p (v/logA/T). Moreover, 
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N- l / 2 H(A( 1 Y(z£y i -A' ^)(S u -Z u0 )Z-Zn = O p (m N uir q y/N log N/T) = o p (l). There- 
fore, by the assumption that (NT^/N)~ 1 J2iLi Y^=i ii{ u isu' s — Eu is u' s )f3 t = o p (l), we have, 



±=HAW$P)-\S U - Z u0 )Z-£u t = -LHA' E^(S U - E uQ )E^u t + o p (l) 

- N N T 

= 7Pr7^ H ^2^2^2^ u i' u i'> - E Ui S u js )/3 jt = Op(l). 
T ViV . =1 . =1 s=1 

Finally, N^HA™ (W)' 1 ^ - E u0 )E^u t = O^m^ 9 ). □ 
Lemma B.16. For any #red t < T ; iV- 1 / 2 ^^ 15 ) -1 - Z~o)u t = o p (l) 

Proo/. We note that, A^A^E^)- 1 = jV" 1 / 2 S(Et 1) -E u0 )ft H-Op^m^" 29 )- 
On the other hand, 

1 1 N 1 - 

^7 S (E^ - E u0 )A = -7= / J (-Rii - E u0 ,ii)6A* + ^7 V, - ^u0,ij)t,i(3jt 

The result of the proof is very similar to that of Lemmas IB. 101 and Theorem 13.1, based on 
the expansion (IB.lOp and Theorem 3.3, hence is omitted. □ 

Proof of asymptotic normality 

We now fix t, then Lemma [B.12I gives J'f t = O p (m 2 N <jj 2 ^ 2q ). Hence \fNJ'f t is negligible 
as VNm 2 N UT~ 2g = o(l). Moreover, (AW^E^)- 1 ^^)- 1 ^^'^)- 1 ^ is of smaller order of 
(AW^)- 1 ^ 1 ))- 1 ^ 1 )'^)- 1 ^. hence is negligible. Next, 



NiA^iE^m-^nm-^ = VN(A^A )^A' Q E-Ju t 
+O p (iV- 1 /2 )( A(i)' ( g(i) ) -i _ + O p {m N u l T -«) 

where we used (^'(Sl 15 )- 1 ^')" 1 - (AqS^Ao)- 1 = O p (N- 1 m N u]r q ). By Lemmas EH] 
and EH N' 1 ' 2 ^' {Wy l - AqE-q 1 )^ = Op(l). This implies, for each fixed t, 



N(f} 1] -f t ) = s/NiA'^Aoy'A'^ut + O p {VNm 2 N u^ 2,1 + m N u 1 - q ) 
iV(A' S; 1 A )- 1 A' S;o 1 ^ + ^(l)- 
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The asymptotic normality then follows from the fact that 

1 - 

* i=l 

C Proofs of Section 4 

C.l Proof of Theorem 4.1 

Define 

-^log|S n0 | - —tr(S u ^) - jj-^WijlEuortl, 

Let L C (A, E u ) = L 2 (A, E u ) - iV" 1 log |E u0 | - A^tr^E^ 1 ) - A~Vt £ ¥i ^uo,ij\- Then 
the minimizer of L c is the same as that of L 2 . This implies L c (A^- 2 \ E^) < L c (A ,E u0 )- 
Recall the definitions of Q 2 (A, E u ) and Qa(A, E„). Then 

L C (A, E u ) = Q!(E U ) + Q 2 (A, E u ) + Q 3 (A, E u ). 

Lemma C.l. There is a nonnegative stochastic sequence < rfy = O p (A/" -1 log A/" + 
T-^Q.ogN) 1 / 2 ) such that Qi(sl 2) ) < al T with probability one. 

Proof. We have Q 2 (A (2) , sl 2) ) > 0. In addition, Q 2 (A ,E u0 ) = Qi(S«o) = 0. Hence 

q^)) = L c (A( 2 ),s( 2 ))-g 2 (A( 2 ) ) s( 2 ))-g 3 (A( 2 ),s( 2 )) 

< L C (A( 2 ), El 2 )) - Q 3 (A( 2 \ Si 2 )) < L C (A , S u0 ) - Q 3 (A^\^) 
= Q 3 (Ao,S u o)-Q3(A (2) ,Si 2 )). 

By the definition of 0a x T, there is 5 > such that <d\ x T C Eg. The result then holds for 
d T = |Q3(A ,S u0 )| + |g 3 (A( 2 ),Ei 2) )| by Lemma El 

□ 

Throughout, let (recall that D = u j)es v 

a = (e( 2 ))- 1 -e; 1 , k t = £ is-wl- 

(i,i)eS £ 
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Lemma C.2. For all large enough T and N, 

NQt(E^) > ~/x T min Wij V" \Y, Utij - E u0 ,i,-| + c\\ A\\ 2 F - 2/x T max w i:j K T 

- I O p U l -^)VNTD + (Pr max w^Vd) \\A\\ f . 

y V J ^i,(*j)eS(7 / 

Proof. Let fi = ^uo; ^ = (El 2 " 1 ) -1 . For any E u , let f2 = E" 1 . Define a function /(t) = 
— log |n + *A| + tr(^(O + tA)), t > 0. Then - log |0| + tv(Sjl) = /(l); -log|O | + 
tr(5 u fi ) = /(0); and 

iVg^Sl 2 )) = /(l) - /(0) + fir5> fi |E Mj | -/XT^^-IE^I (C.f) 

By the integral remainder Taylor expansion, /(l) — /(0) = f'(0) + /^(l — t)f"(t)dt. We 
now calculate /'(0) and f"{t). Using the matrix differentiation formula, we have, f'(t) = 
tv(S u A) - tr((fi + tA) _1 A), which implies, 

/'(0) = tr((S u -S u0 )(fi-fio))=tr(fi (^-S u o)fi(S u o-Si 2) )) 

= y^(^o('S'u — S u0 )^)ij(S u o — E^)y. 

Note that both ||fio||i and ||^||i are bounded from above for E u0 , Yff G T. By Lemma [ATTT ii) . 
maxjj \(£l (S u - ^uo)^)ij\ < max,y - S u0 )ij|||fio||i||^||i = O p (^/\ogN/T). Therefore, 
|/'(0)| = O p {y/logN/T) Z id |S«o,ij - £«,«!■ In addition, 

/"(£) = tr((Q + tA)- 1 A(fi + tA) _1 A) = vec(A)(n + tA)" 1 ® (Q + tA) _1 i;ec(A), 

where wee dentoes the vectorization operator and <g> denotes the Kronecker prod- 
uct. Since both (A^^Ei 2 -*) and (A , E m0 ) are inside 6a x T, sup 0<t<1 A max (t(sl 2 ' ) )~ 1 + 
(1 — t)S~Q) is bounded from above, which then implies inf < t <! A min [(fi + ^A) _1 ] = 
inf <t<i A~? lx (t(Ei 2 ' ) )~ 1 + (1 — t)S^ 1 ) is bounded below by a positive constant c. Hence 
inf <t<i f"{t) > c\\A\\ 2 F . From (JCJ} and /(l) - /(0) > -|/'(0)| + c II a IIf> we have 



NQi(E( 2) ) > /^y^d^.d _ ^T^Wij|S«o,y| + c||A||| - O p (y °^ V " ) |E u0 ,ij - E^'l 
= A&r /J Wy|E U) y| + /xt ^ Wjj|E M| ij| - /x r 2jwjj|E M0) ^| +c||A||| 
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Since > S n o,?j| |^uo,y|) and J^j^j ify | ^uo,ij I = X/i^j,(ij)e5 tr ' lt '«jl^'wo,ij| + 

^(M)eSz, WijI^MCiil- It follows that 



iVQi(Ei 2) ) > /Ut ~ S «o,iil - Op(W^|^) |E u0 ,y - £ u ,y| + c||A||| 

> (jpt min Wij - O p (J^^-)) V |E UiiJ - - £ u(W | + c|| A||| 

—2 [I? u^ij | £«o, | — O p ( ^ ^ -) | S u o,ij — S u> y | 

-Ht max ty^ V" |£ u0iij - 

> -fir min iwy |E U) y-S u0 ,ij| + c||A|||,-2/i T max w i:j K T 

-O p (J^^)VN + D\\A\\ F - [It max ^A^VA 

which implies the desired result. □ 
Lemma C.3. 

T7 II s « - ) II f = Op ( — ( ht max toylfr + log iV + /4 max wJ-D 

DlogiV + 



p iVT V T 
Proof. Lemma [C.2I implies 



iVQi(Si 2) ) > c||A|| 2 - 2/i T max Wij K T - ( O p (J^-)VnTD + /i T max i^VD ) ||A|| f . 
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Lemma [C3] gives JVQi(e! 2) ) < O p (\ogN + Ny/logN/T). Hence we have 



l|A||| = Q P ((V (iV + ^ )lQgiV +^.,max WlJ VD?) 

+O p (/i T max WijK T + log N + Ny/\og N/T) 

= p { ^ N + D ^ logN + ^2 max w^d + ht max w^K? + log A" + N \/\ogWjT) 
= OJ —^— + /4 max w^D + fMT max w^A^ + log N + Ny/log iV/T). 

Note that E u0 — S« = S« AE u0 - Hence the desired result follows from ||Eu || < M wpl 
and ||E u0 || < M. 



□ 



Lemma C.4. iV 1 E(ij) e s L ~ ^uo,ij \ = o p {\). 
Proof. Lemma [C.2I implies 

min Y] |E Uiij - E u(W I < NQ^f,^) + 2/i T max iw y if T 



+ \O p ( 



l ^-)y/N + D + fi T max ^Vtf) ||A|| F . 



We have JVQi(eL 2) ) < O p (log JV + Ay log JV/T). By Lemma 



II a IIf = O p (J — + /it max w^VZ)) 



+O p {J/Pr max Wij AT r + v^giV + ^(^V 4 ). 



which implies the desired result under Assumption 4.2. 



Lemma C.5. A^A^E^)- 1 " S^)Ao = 0,(1). 



□ 



Proo/. Let A x = si 2) - S m0 , 5 = A'qE" 1 = (6,-,6v), and V = (E^)- 1 ^- Since the i x 
norms of (E^)" 1 and E^q 1 are bounded away from infinity, we have, sup^jy \\Vi\\ — O p (l) 
and sup^jv ||^|| = 0(1). Then 

^(S-o 1 - (Sl^-^Ao = ^HA^i + ^ E ^A lf « 



50 



The first term on the right hand side is o p (l) by Lemma [C.41 and the second is bounded by 
TV— 1 II sL 2 ^ — E u0 1| V N + D (using Cauchy-Schwarz inequality), which is also o p (l) by Lemma 
IC.3I and Assumption 4.2. 

□ 

Lemma C.6. For (A, E) = (A (2) ,sl 2) ) ; Lemraa\AJhis satisfied. 

Proof. We first show part (i) of Lemma IA.3I Since L C (A( 2 ), El 2 "*) < L c (A ,S n0 ), and 
<5i(E m0 ) = Q2(Ao,E n0 ) = 0, there is a nonnegative sequence d n = O p (N~ l \ogN + 
T-^ilogN) 1 ' 2 ) such that Qi(si 2) ) + Q 2 (A( 2 \ si 2) ) < d n . Lemma then implies < 
Q 2 (Ei 2) , A( 2 )) = o p (l). On the other hand, 

Q 2 (£i 2 ),A( 2 )) = itr [A^r^-A^^ 

The matrix in the bracket is semi-positive definite. Hence 

^A' (E( 2 ))- 1 A - (J r - J)^A'(Si 2 ))- 1 A( 2 )(/ r - J)' = 0,(1). (C.2) 

Finally, the desired result follows from Lemma IC.5I 

The first order condition in part (ii) is easy to derive and is the same as that in Bai and 
Li (2012). 

□ 

Proof of Theorem 4.1 

A/" _1 ||e1 2 ^ — Euoll 2 ? = o p (l) follows from Lemma [C.3I and Assumption 4.2. On the other 
hand, equation (jC.2ft also implies 

1(A( 2 ) - Ao)^ 1 ^ - A ) - = 0,(1). 

By Lemma \AM N^JH^J' = o p (l). Hence AT -1 (A^ - A )'E- 1 (A - A ) = o p (l), which 
implies the consistency iV — 1 1| A — A || 2 = o p (l) because the eigenvalues of E" 1 are bounded 
away from zero. Q.E.D. 

To prove the consistency of f\ 2 \ we note that the expansion f lB~T9|) still holds for f\ 2) . 
Since J = o p (l) by Lemma IA.5| and u is of smaller order than u t for each fixed t. Hence 
f}V - f t = O p {N- l )A^' '(El 2) )-V + o p (l). Moreover, since IKS^Vl and ||El 2) || are both 
O p (l) and ||A( 2 )||i7 = O p (yN) by the restriction of the parameter space ©a x T, we have 
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Ar-iHA^'C^^-i-Ao^-o 1 !^ = O p (A^- 1 /2||A(2)_A || i , + A^- 1 /2||g( t 2)_^ o|| ^ )? which ig 
as proved above. Therefore, since A^ _1 A' S~ 1 'u t = iV -1 Y^i^it = O p (N^ 1 ^ 2 ), 

F -f t = O p (N- l )A' E^u t + o p {\) = o p (l). 

C.2 Proof of Theorem 4.2 

We now verify Assumption 4.2 for the Adaptive Lasso. 

Lemma C.7. For adaptive lasso, 

(i) mhij^jj)^ |S M0 ,ij| 7 maxj^jj^Sp Wij = O p (l). 

(ii) 5^max( iJ)e s L Wij = O p (l), 

(m) up(mm {iJ)&SL Wij)- 1 = O p (l) (recall that u T = N' 1 / 2 + T _1 / 2 (log N) ). 

Proof. By Lemma TB.5I max^jy^xjy |S* ^ — £ u o,jj| = O p (r). Given this result and the as- 
sumption that min^gSy |£uo,ij| ^> wr, we have result (i). For any G Sl, the following 
inequality holds: 5^p < w^ 1 < (|S u0 ,ij| + \^uo,ij ~ ^u,ij\ + ^t) 7 , which then implies results 
(ii) and (iii), due to the assumptions that 5t = o(u>t), and E u0) ij = 0(ut). □ 

Proof of Assumption 4.2 for Adaptive Lasso 

It follows from the previous lemma that o>t = O^cu^min^^ ^s v l^wo.yl) -7 ) = °p(l)> 
and /3t = O p ((ujt/ 5tV)- By the assumption that Z) = 0(N), 



, , T N f T [N N I f / T \ 1/4 / JV 

C = min<*/- — — , \ — , -====>» nun 



log TV Zr \logNj V D V-D log AM V^g^/ V lo 8 N 



Hence ax = O p (Q. This together with the lower bound assumption on 5t yields Assumption 
4.2 (i). 

For part (ii), note that ax = o p (l) implies that with probability approaching one, 



N 2 N 2 A7 r N jN N [W 

, Ctrr \ = N, H1UH ,\ , Ct^ \ = \ . 

D D ' D \ D D V D 



By Lemma IC.7( ii). (recall that Kt = J2(ij)^s L l^«o,yl) anc ^ the lower bound 5t 3> 
u T {K T /N) lh ', /i T max (!j)6Si wyKr = O p (n T 5pK T ) = o p (N). 

By Lemma IC.7( i) and the assumptions that D = O(N) and min^ ^gs^ |S M0 ,ij| 3> out, 
we have /i T max^- (iii)e5l7 u^- = O p /i T (min^- ( ij)e5l7 |S u0 ,ij | 7 ) -1 = o p (y/N/D), due to the 
upper bound on /j, T = o(ui T ). Finally, by Lemma ICTfT iii) and the assumption that /j,t 3> u) T +1 , 

we have ht m ^ n (ij)es L w ij ^ 

Proof of Assumption 4.2 for SCAD 
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Since fx T / min^- (ij^Su \Rij\ = o p (l) and max (i:j)&SL \R ij \ = o p (/i T ), it is easy to 
verify that with probability approaching one, max i ^j^ i j) e s u Wij = 0, min^j)^ Wij = 
max( i; j) e s L w ij — fi>T- Hence ax = and 0t — 1. This immediately implies the desired 
result. 
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